How Cloudflare's 'Code Orange: Fail Small' Project Strengthened Network Reliability

Cloudflare recently completed a major engineering initiative called Code Orange: Fail Small, aimed at making its global network more resilient, secure, and reliable. Over two quarters, the team focused on preventing the kind of configuration errors that led to outages in November and December 2025. The result is a series of improvements that make the network safer for all customers. Below, we answer key questions about the project and its impact.

1. What is Code Orange: Fail Small, and why did Cloudflare undertake it?

Code Orange: Fail Small was an intensive engineering effort focused on increasing the resilience of Cloudflare's infrastructure. The project was prompted by two global outages in late 2025 that affected customer traffic. The core idea was to ensure that any failure in the network would be limited in scope and impact—hence the name “Fail Small.” By rethinking how configuration changes are deployed, how failures are contained, and how incidents are managed, Cloudflare aimed to dramatically reduce the likelihood of large-scale outages. The project is now complete, though continuous improvement remains a priority.

How Cloudflare's 'Code Orange: Fail Small' Project Strengthened Network Reliability — Source: blog.cloudflare.com

2. What were the primary goals of the project?

The project targeted several key areas:

Safer configuration changes – Ensuring that updates to network settings are rolled out gradually and monitored in real time.
Reducing the impact of failure – Containing problems to avoid widespread disruption.
Revising “break glass” procedures and incident management – Making emergency access and response more reliable.
Preventing drift and regressions – Implementing measures to maintain improvements over time.
Strengthening customer communication – Providing clearer updates during outages.

These goals were designed to make the network not only more robust but also more transparent to users.

3. How did Cloudflare make configuration changes safer?

Previously, internal configuration changes could reach the network instantly, increasing the risk of a faulty change causing widespread issues. Now, Cloudflare uses health-mediated deployment for all high-risk configuration pipelines. This means changes are rolled out progressively, with real-time health monitoring at each stage. If a problem is detected, the deployment is automatically rolled back before it affects customer traffic. The approach mirrors the safe deployment practices already used for software releases, but is now applied to configuration data. A new internal system called Snapstone was built to unify this process across different teams and configuration types.

4. What is Snapstone, and how does it work?

Snapstone is a custom-built internal component that enables health-mediated deployment for configuration changes. It bundles configuration updates into packages and releases them gradually while monitoring health signals. If a change causes performance degradation, alert thresholds are crossed, or errors appear, Snapstone automatically triggers a rollback. Its key innovation is flexibility: teams can define any unit of configuration that needs health mediation—whether a data file (like the one that caused the November outage) or a control flag in the global config system (like the December event). Before Snapstone, each team had to build its own progressive rollout system, leading to inconsistency. Now, a unified tool makes safe deployment the default across the network.

5. How does health-mediated deployment prevent outages?

Health-mediated deployment stops bad changes from reaching the entire network at once. Instead of pushing a configuration update to all servers simultaneously, the change is released to a small subset of servers first. Observability tools track key metrics like latency, error rates, and throughput. If those metrics degrade, the rollout halts and reverts automatically. This limits the blast radius to a minimal number of servers. For example, if a configuration change accidentally disables a critical security rule, health mediation catches it before the rule is removed network-wide. The methodology also improves incident response: because changes are rolled out in stages, the team can identify the offending change faster and understand its impact.

6. How did Cloudflare improve incident management and customer communication?

Beyond technical safeguards, Cloudflare revised its “break glass” procedures—the emergency steps engineers take to override normal controls during critical failures. These procedures were tightened to ensure they are safe and auditable. Incident management processes were also updated to allow faster detection, diagnosis, and resolution. On the communication side, Cloudflare now provides more detailed and earlier notifications to customers during an outage. This includes clear explanations of the cause, expected impact, and steps being taken. By improving transparency, Cloudflare aims to reduce uncertainty and help customers better manage their own services during network events.

7. What measures prevent the project's improvements from degrading over time?

To prevent drift and regressions, Cloudflare introduced automated testing and continuous monitoring for the new deployment pipelines. Changes to health-mediated deployment tools are themselves subject to safe rollout. Additionally, incident post-mortems now include explicit checks to ensure that any new configuration pipelines adopt Snapstone. Code reviews and training sessions reinforce the importance of the new procedures. By embedding these checks into the development lifecycle, Cloudflare ensures that future teams will maintain the same high standards, preventing the network from slowly becoming less resilient as new features are added.

8. What does this mean for Cloudflare customers going forward?

For customers, the most tangible benefit is increased confidence in the network’s stability. The changes mean that most configuration updates will be rolled out gradually with automated safeguards, reducing the chance of a single error affecting all traffic. In the event of a problem, impact is contained and response times are faster. Customers can also expect clearer communication during incidents. While no system can guarantee zero disruptions, the Code Orange: Fail Small project significantly lowers the probability of major outages. Moving forward, Cloudflare will continue to iterate on these processes, with network resilience as an ongoing priority throughout the product lifecycle.