Technical
The Peril of Missed Schema Migrations
Andrew Fong
How often has a missed schema migration caused your application code to break in production?
If that has ever happened to you, this blog post is for you.
The Situation
Prodvana has a backend database that stores persistent information like every other application since computers evolved from punch cards. Our database contains a schema that describes our data model for deployments, desired states, and other relevant information. Our API server knows how to read from the database and return responses to clients, such as the UI, command line tools, and more.
Our team locally develops and can modify their local database as APIs, and the data model evolves. For example, as we build RBAC functionality, we must add rights tables and other associated user information required for RBAC.
In this highly iterative tight loop, making changes to the API and database simultaneously is trivial, given the illusion of an atomic operation.
This breaks down when you must roll out your code and schema change in a specific order.
In the early days of Prodvana, we had a convention that the schema change was consistently applied to our databases before we rolled out code.
This is a fairly common convention across engineering teams and generally becomes part of your cultural norms as it did with us.
Until we forgot.
The Failure
We forgot to apply the schema change before the API server was deployed, and we caused an outage on our internal instance of Prodvana. Not awful, but we realized we needed to prevent this from happening again. We decided to ensure validation and not perform automatic migrations since they were relatively infrequent and required close database monitoring.
Solution
The solutions to this problem can be categorized as process-oriented, fragmented automation, and managed delivery.
Process Oriented
The “Did you file your TPS Report?” approach is where many organizations start. I’ve seen this approach at AOL and YouTube, where TPMs, release managers, and engineers participated in change management committees and had formal documents. These documents were followed, with an observer checking each step.
At YouTube, hundreds of engineers in IRC were online validating site changes while a TPM checked off steps individually. This meant we missed steps, humans changed operations mid-flight, and the process was exceedingly chaotic despite a highly elaborate checklist.
Fragmented Automation
With fragmented automation, you codify rules in various places across different teams. You might try to chain pipelines in your CD system to ensure your database pipeline runs before necessary code deployments. But what are the necessary code deployments, and how do you remember to ensure this happens for all new pipelines?
Another solution I’ve seen a startup team do is have the webserver themselves acquire a lock and apply a schema migration as part of their restart. This is excessively complicated and very dangerous. For the startup team that did this, it caused multi-hour outages due to the system behaving non-deterministically.
These fragmented automation solutions may seem reasonable when you’re first starting, but they will result in considerable debugging and error handling when things go wrong.
Prodvana’s Managed Delivery
Our solution was to continue to eat our dog food and use the Protections aspect of Managed Delivery.
Protections allow us to continuously validate any aspect of the system and create dependencies between services. Additionally, because of Prodvana's data model, every new associated service uses the Protection — no more trying to remember if everyone is using the proper rules.
Our Protection validates if a migration is in progress and a minimum version has been satisfied.
To use our new code in Prodvana, you create a Protection definition and then reference it. This is accomplished in 23 configuration lines and will apply to 100% of your services.
The Protection definition defines what to execute to validate that the schema is correct:
You then reference the name of the protection in your service configuration with this snippet:
We can see the “spanner-migrated” protection passed!
Clicking through to the details of Protection, we can see the check details, which appear as any other service running in the system.
The output here is that we require schema version 192, which is where we are, and the “Dirty: false” indicates that no migration is in progress.
However, it gets better! If you accidentally deploy and there is a migration in progress, Prodvana will automatically start the release after completing the migration!
Protections give an extremely fast way for Reliability Engineers and other members of your engineering team to codify rules and focus on correctness.
Much of what Production Readiness Reviews seek to accomplish can be built into Protections, driving clarity and creating insane leverage for your reliability and platform teams as you scale your services.
If you’d like to learn more, schedule a demo!