Industry Thoughts
Why full context matters for safer deploys
Mike Cugini
As we’ve mentioned before, CD’s static nature is one of its fatal flaws. Static pipelines treat “delivery” as a discrete event without considering the ever-changing contexts surrounding it — whether it’s infrastructure changes, changes coming from other teams’ deployment pipelines, or changes in user behavior. CD usually misses these signals.
CD is not intelligent enough to look for various sources of change and make sure it’s safe to deploy in the first place. And in the event of a failure, CD only gives us a limited window around software changes, so it’s not easy to find the context that led to something going wrong. Let’s take a look at three common scenarios where the static nature of CD, and its lack of context, causes failures in production.
Which of these has your team faced?
Infrastructure changes went bad.
At a large SaaS company, my team built an internal deployment system. At first, the deployment system only understood and handled syncing code changes—essentially packages being rsynced to remote hosts. Over time, the Infrastructure team started to adopt tools like Puppet or Chef to ensure other constraints were met, but these tools were introduced out of band, with no synchronization between the systems.
The deployment system had no context of the infra changes that were happening. There was no clear way to say, “this deployment relies on a change being made by Puppet/Chef,” and ensure it happened before deployment time.
Without the context of these infrastructure changes, deployments would fail without careful human coordination. These failures were difficult to debug because the deployment system could not provide the context of the missing infra changes. To fix the problem, we moved away from systems like Chef for the service setup required, putting that information into a single config file for each service instead.
More and more things ended up defined in the service config, which gave CD more context, but at the cost of additional complexity and an explosion of configuration length.
(We talk more about how to solve this in Imperative configs are out; Declarative configs are in.)
Siloed pipelines
Later when I was on a platform team, we supported many engineering teams with varied requirements. Each team had to come up with a deployment pipeline that made sense for their particular service, but most of them did not have the operational experience to determine this on their own.
We attempted to provide basic guardrail pipelines and guides on setting up good deployment hygiene. However, we could not predict the specific needs or circumstances these teams would encounter with their particular service.
Teams were encouraged to deploy into multiple data centers (distinct failure domains). With a static pipeline, this often meant choosing one region to act as a canary before moving on. With many teams operating independently, the environments in each data center could have unexpected differences. For example, changes started in one data center but were not rolled out to others. This meant that often the simplistic single data center canarying pattern was inadequate to catch all issues.
Teams could set up more complex pipelines to canary across multiple data centers, but what happens if one data center succeeds and another fails? A static list of pipeline steps does not lend itself to coordinating multi-region deployments or dependencies crossing team boundaries.
Teams could further divide their deployments and have multiple independent pipelines (or even ones that trigger other pipelines). Still, the complexity of the configuration rapidly expands so that it’s easier to make mistakes and to miss an edge case in the configuration. Teams have to know ahead of time what types of situations to expect, and it becomes hard to see what has failed when failures in the pipelines happen.
Typically the only recourse to a failure is to roll back or roll forward. Either case means starting over. Pipelines also may be spread over multiple days for safety, which could considerably delay changes. We intended to increase velocity but we had the opposite effect. The pipeline complexity just increased and increased, ultimately decreasing velocity.
Am I healthy?
I helped a team with many low queries per second (QPS) services. This made it difficult to detect issues around deployment time because they often did not see issues for hours or days due to the low traffic.
The biggest issue here? This was part of a billing system, where it was critical for data to be correct.
When problematic code made its way to production, incorrect data could be recorded, requiring manual cleanup at end of the quarter (when the issue would be detected). Other issues included billing being silently broken for users in certain parts of the world, but because of the low number of requests in that region, the issue would not be detected for days.
With a static pipeline, you have a discreet push event — the pipeline runs for some pre-determined amount of time and either succeeds or fails. So either you set a long push duration to give more opportunity for a problem to be detected but delay the rollout of your changes, or you set a short duration to get your change out sooner but potentially miss issues at deploy time. CD has limited data available during that push event, so if little or no traffic came in during that period, failures might not be detected. Detection gets pushed to other monitoring systems after the fact, and on-call teams are on the hook to deal with it and will have to backtrack to figure out what happened. In an incident scenario, a human has to evaluate the state of the service (deployment state, metrics, errors) before determining how to remediate. They’ve lost the context of the change that caused the issue.
Prodvana’s Dynamic Delivery Platform avoids these issues through continuous evaluation.
How does Prodvana avoid these issues? Our Dynamic Delivery Platform is constantly evaluating the state of your systems, so it can detect changes outside the context of a discrete deployment event. It is aware of infrastructure changes, it knows what changes are being deployed across siloed applications and data centers, and it takes in alerts and metrics from monitoring systems and can detect changes even after new code has been deployed.
Dynamic Delivery uses this intelligence to determine if it’s safe to deploy your change (or continue to deploy your change). And if problems occur, it has the context of the changes made most recently, cutting down the time it takes to identify the root cause.
Sign up here to learn more about Prodvana’s Dynamic Delivery Platform and discuss your needs!