Skip to content

Commit

Permalink
docs: Create feature-flag-best-practices.md (#4804)
Browse files Browse the repository at this point in the history
<!-- Thanks for creating a PR! To make it easier for reviewers and
everyone else to understand what your changes relate to, please add some
relevant content to the headings below. Feel free to ignore or delete
sections that you don't think are relevant. Thank you! ❤️ -->

## About the changes
<!-- Describe the changes introduced. What are they and why are they
being introduced? Feel free to also add screenshots or steps to view the
changes if they're visual. -->

<!-- Does it close an issue? Multiple? -->
Closes #

<!-- (For internal contributors): Does it relate to an issue on public
roadmap? -->
<!--
Relates to [roadmap](https://github.com/orgs/Unleash/projects/10) item:
#
-->

### Important files
<!-- PRs can contain a lot of changes, but not all changes are equally
important. Where should a reviewer start looking to get an overview of
the changes? Are any files particularly important? -->


## Discussion points
<!-- Anything about the PR you'd like to discuss before it gets merged?
Got any questions or doubts? -->

---------

Co-authored-by: Ivar Conradi Østhus <[email protected]>
  • Loading branch information
ardeche07 and ivarconr authored Sep 22, 2023
1 parent 91edae3 commit 4f5f1f3
Show file tree
Hide file tree
Showing 13 changed files with 367 additions and 2 deletions.
21 changes: 21 additions & 0 deletions website/docs/topics/feature-flags/availability-over-consistency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
title: 6. Design for failure. Favor availability over consistency.
---

Your feature flag system should not be able to take down your main application under any circumstance, including network disruptions. Follow these patterns to achieve fault tolerance for your feature flag system.

**Zero dependencies**: Your application's availability should have zero dependencies on the availability of your feature flag system. Robust feature flag systems avoid relying on real-time flag evaluations because the unavailability of the feature flag system will cause application downtime, outages, degraded performance, or even a complete failure of your application.

**Graceful degradation**: If the system goes down, it should not disrupt the user experience or cause unexpected behavior. Feature flagging should gracefully degrade in the absence of the Feature Flag Control service, ensuring that users can continue to use the application without disruption.

**Resilient Architecture Patterns**:

- **Bootstrapping SDKs with Data**: Feature flagging SDKs used within your application should be designed to work with locally cached data, even when the network connection to the Feature Flag Control service is unavailable. The SDKs can bootstrap with the last known feature flag configuration or default values to ensure uninterrupted functionality.

- **Local Cache**: Maintaining a local cache of feature flag configuration helps reduce network round trips and dependency on external services. The local cache can be periodically synchronized with the central Feature Flag Control service when it's available. This approach minimizes the impact of network failures or service downtime on your application.

- **Evaluate Locally**: Whenever possible, the SDKs or application components should be able to evaluate feature flags locally without relying on external services. This ensures that feature flag evaluations continue even when the feature flagging service is temporarily unavailable.

- **Availability Over Consistency**: As the CAP theorem teaches us, in distributed systems, prioritizing availability over strict consistency can be a crucial design choice. This means that, in the face of network partitions or downtime of external services, your application should favor maintaining its availability rather than enforcing perfectly consistent feature flag configuration caches. Eventually consistent systems can tolerate temporary inconsistencies in flag evaluations without compromising availability. In CAP theorem parlance, a feature flagging system should aim for AP over CP.

By implementing these resilient architecture patterns, your feature flagging system can continue to function effectively even in the presence of downtime or network disruptions in the feature flagging service. This ensures that your main application remains stable, available, and resilient to potential issues in the feature flagging infrastructure, ultimately leading to a better user experience and improved reliability.
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: 9. Choose open by default. Democratize feature flag access.
---

Allowing engineers, product owners, and even technical support to have open access to a feature flagging system is essential for effective development, debugging, and decision-making. These groups should have access to the system, along with access to the codebase and visibility into configuration changes:

1. **Debugging and Issue Resolution**:

- **Code Access**: Engineers should have access to the codebase where feature flags are implemented. This access enables them to quickly diagnose and fix issues related to feature flags when they arise. Without code access, debugging becomes cumbersome, and troubleshooting becomes slower, potentially leading to extended downtimes or performance problems.

2. **Visibility into Configuration**:

- **Configuration Transparency**: Engineers, product owners, and even technical support should be able to view the feature toggle configuration. This transparency provides insights into which features are currently active, what conditions trigger them, and how they impact the application's behavior. It helps understand the system's state and behavior, which is crucial for making informed decisions.

- **Change History**: Access to a history of changes made to feature flags, including who made the changes and when, is invaluable. This audit trail allows teams to track changes to the system's behavior over time. It aids in accountability and can be instrumental in troubleshooting when unexpected behavior arises after a change.

- **Correlating Changes with Metrics**: Engineers and product owners often need to correlate feature flag changes with production application metrics. This correlation helps them understand how feature flags affect user behavior, performance, and system health. It's essential for making data-driven decisions about feature rollouts, optimizations, or rollbacks.

3. **Collaboration**:

- **Efficient Communication**: Open access fosters efficient communication between engineers and the rest of the organization. When it's open by default, everyone can see the feature flagging system and its changes, and have more productive discussions about feature releases, experiments, and their impact on the user experience.

4. **Empowering Product Decisions**:

- **Product Owner Involvement**: Product owners play a critical role in defining feature flags' behavior and rollout strategies based on user needs and business goals. Allowing them to access the feature flagging system empowers them to make real-time decisions about feature releases, rollbacks, or adjustments without depending solely on engineering resources.

5. **Security and Compliance**:

- **Security Audits**: Users of a feature flag system should be part of corporate access control groups such as SSO. Sometimes, additional controls are necessary, such as feature flag approvals using the four-eyes principle.

Access control and visibility into feature flag changes are essential for security and compliance purposes. It helps track and audit who has made changes to the system, which can be crucial in maintaining data integrity and adhering to regulatory requirements.
35 changes: 35 additions & 0 deletions website/docs/topics/feature-flags/enable-traceability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: 11. Enable traceability. Make it easy to understand flag evaluation.
---

Developer experience is a critical factor to consider when implementing a feature flag solution. A positive developer experience enhances the efficiency of the development process and contributes to the overall success and effectiveness of feature flagging. One crucial aspect of developer experience is ensuring the testability of the SDK and providing tools for developers to understand how and why feature flags are evaluated. This is important because:

1. **Ease of Testing and Debugging:**

- **Faster Development Cycles:** A feature flagging solution with a testable SDK allows developers to quickly test and iterate on new features. They can easily turn flags on or off, simulate different conditions, and observe the results without needing extensive code changes or redeployments.

- **Rapid Issue Resolution:** When issues or unexpected behavior arise, a testable SDK enables developers to pinpoint the problem more efficiently. They can examine the flag configurations, log feature flag decisions, and troubleshoot issues more precisely.

2. **Visibility into Flag Behaviour:**

- **Understanding User Experience:** Developers need tools to see and understand how feature flags affect the user experience. This visibility helps them gauge the impact of flag changes and make informed decisions about when to roll out features to different user segments. Debugging a feature flag with multiple inputs simultaneously makes it easy for developers to compare the results and quickly figure out how a feature flag evaluates in different scenarios with multiple input values.

- **Enhanced Collaboration:** Feature flagging often involves cross-functional teams, including developers, product managers, and QA testers. Providing tools with a clear view of flag behavior fosters effective collaboration and communication among team members.

3. **Transparency and Confidence:**

- **Confidence in Flag Decisions:** A transparent feature flagging solution empowers developers to make data-driven decisions. They can see why a particular flag evaluates to a certain value, which is crucial for making informed choices about feature rollouts and experimentation.

- **Reduced Risk:** When developers clearly understand of why flags evaluate the way they do, they are less likely to make unintentional mistakes that could lead to unexpected issues in production.

4. **Effective Monitoring and Metrics:**

- **Tracking Performance:** A testable SDK should provide developers with the ability to monitor the performance of feature flags in real time. This includes tracking metrics related to flag evaluations, user engagement, and the impact of flag changes.

- **Data-Driven Decisions:** Developers can use this data to evaluate the success of new features, conduct A/B tests, and make informed decisions about optimizations.

- **Usage metrics:** A feature flag system should provide insight on an aggregated level about the usage of feature flags. This is helpful for developers so that they can easily assess that everything works as expected.

5. **Documentation and Training:**

- **Onboarding and Training:** The entire feature flag solution, including API, UI, and the SDKs, requires clear and comprehensive documentation, along with easy-to-understand examples, in order to simplify the onboarding process for new developers. It also supports the ongoing training of new team members, ensuring that everyone can effectively use the feature flagging solution.
25 changes: 25 additions & 0 deletions website/docs/topics/feature-flags/evaluate-flags-close-to-user.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: 3. Evaluate flags as close to the user as possible. Reduce latency.
---

Feature flags should be evaluated as close to your users as possible, and the evaluation should always happen server side as discussed in Principle 2. In addition to security and privacy benefits, performing evaluation as close as possible to your users has multiple benefits:

1. **Performance Efficiency**:

a. **Reduced Latency**: Network roundtrips introduce latency, which can slow down your application's response time. Local evaluation eliminates the need for these roundtrips, resulting in faster feature flag decisions. Users will experience a more responsive application, which can be critical for maintaining a positive user experience.

b. **Offline Functionality**: Applications often need to function offline or in low-connectivity environments. Local evaluation ensures that feature flags can still be used and decisions can be made without relying on a network connection. This is especially important for mobile apps or services in remote locations.

2. **Cost Savings**:

a. **Reduced Bandwidth Costs**: Local evaluation reduces the amount of data transferred between your application and the feature flag service. This can lead to significant cost savings, particularly if you have a large user base or high traffic volume.

3. **Offline Development and Testing**:

a. **Development and Testing**: Local evaluation is crucial for local development and testing environments where a network connection to the feature flag service might not be readily available. Developers can work on feature flag-related code without needing constant access to the service, streamlining the development process.

4. **Resilience**:

a. **Service Outages**: If the feature flag service experiences downtime or outages, local evaluation allows your application to continue functioning without interruptions. This is important for maintaining service reliability and ensuring your application remains available even when the service is down.

In summary, this principle emphasizes the importance of optimizing performance while protecting end-user privacy by evaluating feature flags as close to the end user as possible. Done right, this also leads to a highly available feature flag system that scales with your applications.
35 changes: 35 additions & 0 deletions website/docs/topics/feature-flags/feature-flag-best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: 11 Principles for building and scaling feature flag systems
---

Feature flags, sometimes called feature toggles or feature switches, are a software development technique that allows engineering teams to decouple the release of new functionality from software deployments. With feature flags, developers can turn specific features or code segments on or off at runtime, without the need for a code deployment or rollback. Organizations who adopt feature flags see improvements in all key operational metrics for DevOps: Lead time to changes, mean-time-to-recovery, deployment frequency, and change failure rate.

There are 11 principles for building a large-scale feature flag system. These principles have their roots in distributed systems architecture and pay particular attention to security, privacy, and scale that is required by most enterprise systems. If you follow these principles, your feature flag system is less likely to break under load and will be easier to evolve and maintain.

These principles are:

1. [Enable run-time control. Control flags dynamically, not using config files.](./runtime-control.md)
2. [Never expose PII. Follow the principle of least privilege.](./never-expose-pii.md)
3. [Evaluate flags as close to the user as possible. Reduce latency.](./evaluate-flags-close-to-user.md)
4. [Scale Horizontally. Decouple reading and writing flags.](./scale-horizontally.md)
5. [Limit payloads. Feature flag payload should be as small as possible.](./limit-payloads.md)
6. [Design for failure. Favor availability over consistency.](./availability-over-consistency.md)
7. [Make feature flags short-lived. Do not confuse flags with application configuration.](./short-lived-feature-flags.md)
8. [Use unique names across all applications. Enforce naming conventions.](./unique-names.md)
9. [Choose open by default. Democratize feature flag access.](./democratize-feature-flag-access.md)
10. [Do no harm. Prioritize consistent user experience.](./prioritize-ux.md)
11. [Enable traceability. Make it easy to understand flag evaluation](./enable-traceability.md)

## Background

Feature flags have become a central part of the DevOps toolbox along with Git, CI/CD and microservices. You can write modern software without all of these things, but it sure is a lot harder, and a lot less fun.

And just like the wrong Git repo design can cause interminable headaches, getting the details wrong when first building a feature flag system can be very costly.

This set of principles for building a large-scale feature management platform is the result of thousands of hours of work building and scaling Unleash, an open-source feature management solution used by thousands of organizations.

Before Unleash was a community and a company, it was an internal project, started by [one dev](https://github.com/ivarconr), for one company. As the community behind Unleash grew, patterns and anti-patterns of large-scale feature flag systems emerged. Our community quickly discovered that these are important principles for anyone who wanted to avoid spending weekends debugging the production system that is supposed to make debugging in production easier.

“Large scale” means the ability to support millions of flags served to end-users with minimal latency or impact on application uptime or performance. That is the type of system most large enterprises are building today and the type of feature flag system that this guide focuses on.

Our motivation for writing these principles is to share what we’ve learned building a large-scale feature flag solution with other architects and engineers solving similar challenges. Unleash is open-source, and so are these principles. Have something to contribute? [Open a PR](https://github.com/Unleash/unleash/pulls) or [discussion](https://github.com/orgs/Unleash/discussions) on our Github.
38 changes: 38 additions & 0 deletions website/docs/topics/feature-flags/limit-payloads.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: 5. Limit payloads. Feature flag payload should be as small as possible.
---

Minimizing the size of feature flag payloads is a critical aspect of maintaining the efficiency and performance of a feature flag system. The configuration of your feature flags can vary in size depending on the complexity of your targeting rules. For instance, if you have a targeting engine that determines whether a feature flag should be active or inactive based on individual user IDs, you might be tempted to include all these user IDs within the configuration payload. While this approach may work fine for a small user base, it can become unwieldy when dealing with a large number of users.

If you find yourself facing this challenge, your instinct might be to store this extensive user information directly in the feature flagging system. However, this can also run into scaling problems. A more efficient approach is to categorize these users into logical groupings at a different layer and then use these group identifiers when you evaluate flags within your feature flagging system. For example, you can group users based on their subscription plan or geographical location. Find a suitable parameter for grouping users, and employ those group parameters as targeting rules in your feature flagging solution.

Imposing limitations on payloads is crucial for scaling a feature flag system:

1. **Reduced Network Load**:

- Large payloads, especially for feature flag evaluations, can lead to increased network traffic between the application and the feature flagging service. This can overwhelm the network and cause bottlenecks, leading to slow response times and degraded system performance. Limiting payloads helps reduce the amount of data transferred over the network, alleviating this burden. Even small numbers become large when multiplied by millions.

2. **Faster Evaluation**:

- Smaller payloads reduce latency which means quicker transmission and evaluation. Speed is essential when evaluating feature flags, especially for real-time decisions that impact user experiences. Limiting payloads ensures evaluations occur faster, allowing your application to respond promptly to feature flag changes.

3. **Improved Memory Efficiency**:

- Feature flagging systems often store flag configurations in memory for quick access during runtime. Larger payloads consume more memory, potentially causing memory exhaustion and system crashes. By limiting payloads, you ensure that the system remains memory-efficient, reducing the risk of resource-related issues.

4. **Scalability**:

- Scalability is a critical concern for modern applications, especially those experiencing rapid growth. Feature flagging solutions need to scale horizontally to accommodate increased workloads. Smaller payloads require fewer resources for processing, making it easier to scale your system horizontally.

5. **Lower Infrastructure Costs**:

- When payloads are limited, the infrastructure required to support the feature flagging system can be smaller and less costly. This saves on infrastructure expenses and simplifies the management and maintenance of the system.


6. **Reliability**:

- A feature flagging system that consistently delivers small, manageable payloads is more likely to be reliable. It reduces the risk of network failures, timeouts, and other issues when handling large data transfers. Reliability is paramount for mission-critical applications.

7. **Ease of Monitoring and Debugging**:

- Smaller payloads are easier to monitor and debug. When issues arise, it's simpler to trace problems and identify their root causes when dealing with smaller, more manageable data sets.
Loading

0 comments on commit 4f5f1f3

Please sign in to comment.