Canary Deployment: What It Is and How To Use It

Mark Henke | May 1, 2018

Deploying to production can be risky. Despite all the mitigation strategies we put in place—QA specialists, automated test suites, monitoring and alerts, code reviews, static analysis—systems are getting more complex every day. This makes it more likely that a new feature can ripple into other areas of your app, causing unforeseen bugs and eroding the trust customers have in you.

Taking their cue from the miners of old, developers created the idea of canary deployments: releasing a new feature to just a subset of your users or systems. Rollout.io calls this gradual rollout. If we enable a feature within just part of our system, we can monitor any problems it creates. This lets us keep general customer trust high while freeing us to focus on innovation, delivering excellent new features to our customers.

History in Mining

The term “canary deployment” comes from an old coal mining technique. These mines often contained carbon monoxide and other dangerous gases that could kill the miners. Canaries are sensitive to air-based toxins, so miners would use them as early detectors. The birds would often fall victim to these gases before it reached the miners. This helped ensure the miners’ safety—one bird dying or falling ill could save multiple humans’ lives. In the same sense, the first part of our system to which we release a new feature acts as our canary: it detects potential bugs and disruption without affecting every other system running.

OK, But How Do I Make This Magic Happen?

The idea itself is straightforward, but there are a lot of nuances in how we should approach deploying these features. Often, we must know ahead of time that we’ll be canary releasing.

Does the Feature Need It?

Canary deployments have a cost. They add noise to your codebase that slows down development. The feature’s release will need to be maintained over a noticeable period of time, so this eats a bit into your team’s capacity. If you want to put a feature in a canary deployment, you need to be able to justify these costs.

Does this feature touch multiple areas of the application? Is this feature highly visible to the customers? Does it have a large impact on the customer base? Is it a relatively complex feature compared to others in the application? These types of questions can help you determine if canary deployment will be worth it.

It probably won’t be worth it to canary deploy a new field on the customer admin screen. But it might be worth it if you’re adding a major uplift to customer shopping carts.

What Will Be Your Canary?

It is important to know what things in your system you can use to partition features. There are commonly two areas that make great canaries: users and instances.

By User

Most applications have some concept of user. And most applications also make it easy to get certain pieces of information about the user, such as age, gender, and geographic location. You can query this information when running a feature to see if you should show it to that user.

You could partition by geographical region, showing only your Chinese customers a new feature. Or you could even partition on pure percentage, only showing 5% of users the new feature and seeing if your error counts spike or if your responsiveness slows down. Try to choose a partition where trust is high or where the loss of customer trust will have a low impact. Perhaps sales in your Bulgarian market is small enough that a bad release won’t hurt the bottom line too much.

Another idea is to create an early adopters program, letting people opt into new features. Doing this ensures that customers expect some level of disruption and will be more willing to overlook problems. Video game companies have been doing this for years.

By Instance

Separating by users is an easy way to start canary deployment. But if your system is large enough, you can consider using your application and service instances as canaries. If you have multiple instances of your application, you can configure a subset of them to have the new feature. This can be especially useful if you have multiple regional data centers. However, this is often less flexible than partitioning by user.

A good partition is a sliding scale or a set of discrete values. You want to avoid partitions that are only on/off so that you can better correlate impacts as you scale up the feature in your system.

What Infrastructure Do I Need?

If you want to implement the ability to canary deploy in your system, there are lot of options. The system needs to be able to partition the feature in some capacity, based on what you know will be your canary. You also want to ensure you can change this partition at runtime. This can be homegrown, meaning you can just slap in a database table and a class to take in your user context. You can use your load balancers to route traffic based on regional or user headers in the requests. And you can save some development time and purchase tooling that will make it easy to set up canaries.

How Do I Know if Something Goes Wrong?

Canary deployments will only be useful to you if you can track their impact on your system. You’ll want to have some level of monitoring or analytics in place in your application. These analytics must correlate to how you’re partitioning your features. For example, if you’re partitioning by users in a region, you should be able to see traffic volume and latency by each region. Some useful analytics are latency, internal error count,  volume, memory usage, and CPU usage.

Fortunately, it’s easy these days to wire in analytics and monitoring. Google Analytics lets you slap JavaScript on a page header. You can grab open source options for no upfront cost, or you can get great capabilities through purchasing commercial products. If you’re on a cloud platform, many of these metrics are built in. It’s usually not worth building it yourself, but you may want to tweak an existing package according to your needs.

When Do I Release the Feature to Everyone?

As I mentioned earlier, canary-deployed features need to be maintained over time. Eventually, we want to remove the partition completely and let everyone use the feature.

Have a roadmap of how you will release the feature ahead of time, even if it’s a generic roadmap you use for all your canary-deployed features. This will give the team a big and visible end date in sight. They won’t be caught off guard when disruptions happen in the system and they have to triage them. Eventually, you can kill the canary and remove the noise from your code or configuration.

The roadmap should have a timeline of not only when it will end but also how you you plan to scale the feature. For example, maybe your roadmap is that you’re going to roll out a new product line first to China, then to India, then to all of Asia, and then to the world. Most importantly, it should have a rollback plan that your team members clearly understand and can handle.

Focus on Achieving Excellence, Not Avoiding Risk

If you implement canary deployments for your features, you’ll feel a significant mental weight lift off of you. You’ll find yourself thinking less about production outages and disruptions. Instead, you’ll think more about how to push that next exciting feature to your customers.

About Rollout.io

Rollout is an advanced feature management solution that gives engineering and product teams feature control, post-deployment. It’s the most effective way to roll out new features to the right audience while protecting customers from failure and improving KPI’s.