Techincal

Feb 28, 2024

Feb 28, 2024

Feb 28, 2024

Prodvana Architecture Part 3: The Convergence Engine

Naphat Sanguansin

Most deployment systems are based on pipelines, i.e., a static list of commands to deploy a service. If you recall, in Prodvana's Architecture Part 1, one requirement of the system is to respond to real-time changes. The dynamic nature of today's system architectures means pipelines are insufficient.

Instead, a constant re-evaluation of the desired state against the current state of the world and then the calculation of the next action to take, if any, is a more appropriate algorithm. This new approach is a convergence loop. Recall in Part 2, we outlined how Prodvana produces desired states.

Once the Prodvana Compiler has produced a desired state, the system deploys the desired state. As a reminder, a desired state for a Service contains:

  • a list of Release Channels the Service should be deployed to

  • for each Release Channel

    • the Runtime-level configuration for the Service (e.g., Kubernetes configuration)

    • the Runtime on which to run the Service

    • the rules for deploying the Service in this Release Channel (e.g., only deploy prod after staging has been stable at this same version)

For more information about how the desired state is produced, see the previous post about the Prodvana Compiler in this series.

Prodvana Convergence Engine

The Prodvana Convergence Engine takes desired states from the Prodvana Compiler and current states from the Prodvana Runtime. It continuously makes decisions to converge the current states to their corresponding desired states.

Internal Convergence Engine Terminology

  • Entity - An object tracked by the Convergence Engine

  • Current state - State the entity is currently at

  • Desired state - State that the Convergence Engine is ultimately trying to converge an entity to

  • Target state - A snapshot of the desired state, updated when the Convergence Engine makes a decision

  • Delivery decision - The act of setting a target state from the desired state

Entities

Entities are the most low-level building blocks in the Prodvana Convergence Engine. They are interfaces for different objects the Convergence Engine cares about. Every entity must implement three functions:

  • fetch - what is the current state of the entity?

  • hasWork(target, current) - given the target state and current state, does any work need to happen?

  • apply(target) - take the target state and do any work necessary to get to that desired state.

Entities are implemented as a Directed Acyclic Graph (DAG), where an entity may have one or more children. fetch may look at children's states when computing the entity’s state.

Additionally, each entity can have pre and post-conditions as defined by the compiled Service Configuration. Pre and post-conditions are implemented similarly to Services. However, output from them is not returned in the Servicesfetch.

We define three types of entities:

RuntimeObject

RuntimeObject represents an object materialized in the underlying Runtime, e.g., a Kubernetes Deployment or ECS Service.

  • fetch - Communicate with the underlying Runtime and return the state of the object (e.g., the current version(s) active, replica count, status).

  • hasWork - Always return false.

  • apply - Do nothing (see ServiceInstance below).

  • Has no child objects or pre/post-conditions.

ServiceInstance 

ServiceInstance represents a Service in a Release Channel.

  • fetch - Roll up current states from RuntimeObject children.

  • hasWork - If the target version does not match the current version or the current version is not at a stable status.

  • apply - Pass the compiled Service Configuration to the Prodvana Runtime and have it make the changes.

    • NOTE: the reason apply happens in ServiceInstance instead of RuntimeObjects is to allow for complex scenarios like applying RuntimeObjects in a specific order.

  • Children: RuntimeObject.

  • Pre and post-conditions: as defined by the compiled Service Configuration.

Service

Service represents a Service in Prodvana.

  • fetch - Roll up current states from ServiceInstance children.

  • hasWork - Always return false.

  • apply - Do nothing.

  • Children: ServiceInstance objects.

  • Pre and post-conditions: none.

Entities do not know about their parent’s convergence status. They receive instructions and return results. Also, entities are infinitely flexible and new entities can be defined as needed for more sophisticated workloads. We expand on this later in this post (See: Enabling Sophisticated Workflows with More Entity Types).


Convergence Manager and Entity State Manager

The Convergence Engine comprises two primary components: Convergence Manager and Entity State Manager. 

Entity State Manager is responsible for continuously fetching current states and applying target states for every entity. For every entity, it runs two loops forever:

# fetch loop
for True:
  state = entity.fetch()
  self.save_current_state(entity, state)

# apply loop
for True:
  target_state = self.get_target_state(entity)
  current_state = self.get_current_state(entity)
  if entity.hasWork(target_state, current_state):
    entity.apply(target_state)

# Ommitted sleep intervals and backoffs for clarity

The Entity State Manager knows nothing about pre or post-conditions. It always attempts to move an entity to its target state.

Convergence Manager is responsible for setting the desired state as the target state once preconditions have been met. 

for True:
  for entity in entities:
    current_state = entity_state_manager.get_current_state(entity)
    desired_state = self.get_desired_state(entity)
    status, target_state = self.compute_status_and_target_state(entity.preconditions, entity.postconditions, desired_state, current_state)
    self.save_convergence_status(entity, status)
    entity_state_manager.set_target_state(target_state)

To contrast desired and target states, consider a Service with two Release Channels: staging and production. Production has a precondition of staging being stable. The target state for the staging Service Instance will immediately be set to the desired state.

On the other hand, the production Service Instance's target state will remain at the previously set desired state until staging has converged. Once staging converges, the target state will be set to the desired state, causing production to deploy.

The two components are separated for two reasons. 

First is a separation of concerns. A Convergence Engine is an exceedingly complex system. Splitting responsibilities between two modules allows us to keep both modules relatively simple.

The second is performance. The Convergence Manager needs to run as often as possible so that delivery decisions can happen quickly as new information comes in. Therefore, it should not make network calls or have external dependencies. The Entity State Manager does not have this requirement, as it operates on each entity individually, so it can take as long as it needs to make network calls.

Key Takeaways:

  • With the Convergence Manager, the Entity State Manager, and all entities working together, we can now model a release where one Release Channel goes after another.

  • No history is being tracked in any component. Instead, the Convergence Engine always takes the current state and computes the appropriate next step for each entity. This allows Prodvana to adapt the deployment plan as needed (e.g., pausing a deployment due to an alert firing or skipping a deployment if a Service Instance is already at the correct version).

Enabling Sophisticated Workflows with More Entity Types

The entity interface allows us to model more sophisticated releases by implementing new entities.

Approvals

Our users often want an approval step before deploying to a production Release Channel(s). We implement this with the Approval entity type.

  • Approval

    • fetch - Check if approval has been submitted.

    • hasWork - Always return false.

    • apply - Do nothing.

    • Has no child objects or pre/post-conditions.

An Approval entity waits for approval to be submitted out of the band from the convergence loop itself. The approval may be submitted via the Prodvana UI or API.

Protections

A Protection is a business rule that can be used to gate convergence (precondition) or fail the convergence after a deployment is done (postcondition). These rules can check the internal states of the deployment or call out to any other system.

For example, a Protection can be used to check if alerts are firing or if a database schema migration has been run.

A Protection is implemented with the Protection entity type.

  • Protection

    • fetch - Check if convergence should be paused (today implemented as a Kubernetes Job that exits 0 or 1).

    • hasWork - Always return false.

    • apply - Do nothing.

    • Has no child objects or pre/post-conditions.


Convergence Extensions

A Convergence Extension is a user-defined command that runs as part of convergence.

Examples include running a database migration, building a docker image, and sending Slack notifications before/after deployment. A Convergence Extension is implemented with the ConvergenceExtension entity type.

  • ConvergenceExtension

    • fetch - Check if apply has run.

    • hasWork - Return true if apply has not run.

    • apply - Run user-defined command (today implemented as a Kubernetes Job).

    • Has no child objects or pre/post-conditions.

Going Beyond a Single Release with Maestro

The Convergence Engine, as previously described, makes one key simplification: there is only ever one active release at a time for a Service. However, one of the key features of Prodvana is the ability to queue up releases and have them go out concurrently

For example, consider a system with a single staging environment and two production environments, with the promotion between environments taking a week. Instead of releasing sequentially, which would mean a release every two weeks, one can start a new release on staging after the previous release has been promoted to the first production environment, allowing for weekly releases. 

To support this functionality, we have to modify the interface of the Prodvana Convergence Engine to take a release instead of a desired state. Consequently, this change results in a massive system simplification.

Releases and desired states share the same structure. The difference is how they are used:

  • A release captures the user intent for a set of Service versions to go out.

  • A desired state is the current set of versions to which the Convergence Engine will converge the Services.

When there is only one release, a release and a desired state are the same. However, when multiple releases are in flight for the same Service, the desired state is computed from a combination of the pending releases.

Because the base case of a single release reduces to a desired state, we can abstract the differences between releases and desired states behind a single component and keep the rest of the Convergence Engine unchanged.

This component, Maestro Manager, compiles a list of releases into the effective desired state for a Service. A full technical deep dive on Maestro Manager will come later.

Learnings and Challenges

  • We built the Prodvana Convergence Engine without leveraging a workflow framework like Temporal, because modeling convergence as a static workflow is not straightforward. Instead of executing a series of commands, the Convergence Engine must take the current states and compute the next set of actions. 

  • The Convergence Engine’s interface to the Compiler started as a synchronous API call. This needed to be rewritten as an async operation with retries and state recovery, as our desired states got large enough that we started exceeding our database’s (GCP Spanner) transaction size limit.

  • We opted for simplicity over sophistication wherever possible. For example, the Entity State Store’s fetch loop could have been written as an event-based loop instead of a poll loop. However, polling was more straightforward to implement and was more than enough to scale to our users (thousands of entities, tens of thousands across all users).

Results

  • The Prodvana Convergence Engine has prevented and automatically rolled back over 20% of issues for users.

  • The Prodvana Convergence Engine has allowed us to model behaviors that would have been hard or impossible in a pipeline-based system. For example:

    • If the deployment fails midway through, restarting the convergence means that previous progress was not lost.

    • The convergence can wait for conditions to be true, like a database schema migration finishing, before the code can go out.

Next Step: Executing on a Runtime

The Prodvana Convergence Engine makes deployment decisions, causing apply commands to be sent to the Prodvana Runtime. That is where changes are materialized in our users’ infrastructure.

Continue reading about Prodvana's architecture in Part 4: Runtime Interface.

Intelligent Deployments Now.

Intelligent Software Deployment. Eliminate Overhead with Clairvoyance, Self Healing, and Managed Delivery.

© 2023 ✣ All rights reserved.

Prodvana Inc.

Intelligent Deployments Now.

Intelligent Software Deployment. Eliminate Overhead with Clairvoyance, Self Healing, and Managed Delivery.

© 2023 ✣ All rights reserved.

Prodvana Inc.

Intelligent Deployments Now.

Intelligent Software Deployment. Eliminate Overhead with Clairvoyance, Self Healing, and Managed Delivery.

© 2023 ✣ All rights reserved.

Prodvana Inc.