Spinnaker provides unique building blocks to create tailor-made, and highly-collaborative continuous delivery pipelines. Join them at Spinnaker Summit.
A lot of questions we get from customers are really about Clouddriver: how to scale it, how to diagnose errors or performance issues. We're sharing an overview of the service (no code, I promise) and some tips to operate Clouddriver at scale in the hope it will help the Spinnaker community. This is the first in a series of posts on Clouddriver.
When deploying your app, Clouddriver will create server groups, change load balancers, and inform the rest of the services of what's out there. It is the service that discovers the state of the world and changes it.
Clouddriver works by polling your cloud infrastructure on a regular interval and storing the result in a shared cache (more on that later).
It is used by the following services:
Clouddriver itself initiates communication with:
Clouddriver defines cloud providers (such as AWS, Azure, GCP, CloudFoundry, Oracle, DC/OS, Kubernetes, Docker). Each provider can have accounts (such as a Kubernetes cluster or an AWS account).
There are two main functional areas in Clouddriver: caching and mutating operations.
Caching agents query your cloud infrastructure for resources and store the results in a cache store. Each provider has its own set of caching agents that are instantiated per account and sometimes per region. Each caching agent is specialized in one type of resource such as server groups, load balancers, security groups, instances, etc.
In reality, the number of caching agents varies greatly between providers and with your Clouddriver configuration.
For instance, AWS might have between 16 and 20 agents per region, performing tasks such as caching the status of IAM roles, instances, and VPCs as well as some agents operating globally for tasks such as cleaning up detached instances. And Kubernetes (v2) might have a few agents per cluster, caching things like custom resources and Kubernetes manifests. We'll go over some of these specifics in a later post.
The cache store is where Clouddriver stores cloud resources. It comes in different flavors:
All these stores - with the exception of the in-memory store - work across multiple Clouddriver instances.
The agent scheduler is in charge of running caching agents at regular intervals across all Clouddriver instances. There are 5 types of schedulers:
Note that the cache store does not dictate the type of agent scheduler. For instance, you could use the SQL cache store along with the Redis-backed scheduler.
If you read Clouddriver source code, you'll see references to cats (aka Cache All The Stuff), which is the framework that manages agent scheduler + agents + cache store.
Now that we have all the primitives, the startup sequence should be intuitive: Clouddriver inspects its configuration and instantiates the cache store and the agent scheduler. For each provider enabled, agents are instantiated per account/region and added to the scheduler.
When the scheduler runs:
Operations in detail
Clouddriver has the concept of atomic operations - a single unit of work. Spinnaker pipeline tasks trigger these operations to mutate cloud resources.
There are more than 200 atomic operations available in Clouddriver, such as creating a server group, terminating EC2 instances, or deploying Kubernetes manifests.
Operations statuses are saved in a task repository, that can be backed by: Redis, SQL, in memory, or a "dual" repository to migrate from one store to the next seamlessly.
Note that atomic operations that are sent together are immediately executed together in the same thread.
Atomic operations vary greatly in their complexity. They generally try to be atomic but not always (e.g. deploying multiple Kubernetes manifests). We won't cover atomic operations implementation here but if you're interested, check out Clouddriver's code.
From a user perspective, Clouddriver tasks are not very visible. You can however spot these tasks in the source link of stages:
Each stage and tasks will contain the history of Clouddriver executions under the kato.tasks key:
"status": "Initializing Orchestration Task..."
"status": "Processing op: DisableAsgAtomicOperation"
"status": "Initializing Disable ASG operation for [us-west-2:deploy-preprod-v015]..."
The history contains tasks repeated with their status changes as well as any output. It's quite useful to understand what Spinnaker is actually doing under the hood and troubleshoot potential issues.
We now have the main pieces of the puzzle:
However, most cloud mutating operations are not synchronous. For instance, when Clouddriver sends a request to AWS to launch a new EC2 instance, the API call will return successfully but the instance will take a while before it's ready. Even in Kubernetes, sending a manifest is accepted but it can take a few seconds before the resource is considered ready. This is when Spinnaker uses on-demand caching agents.
On-demand caching agents are - as their name implies - created on demand by the client (Orca) in tasks such as Force Cache Refresh or Wait for Up Instances. They are used to ensure cache freshness and know when a resource is created or effectively deleted.
The main gotcha is that when using a cache store that works across multiple Clouddrivers (like Redis), Clouddriver will wait for the next regular caching agent of the same type to run before declaring the cache consistent. It gives the cache store one more chance to replicate its state (to other replicas in the case of Redis).
Clouddriver handles a couple more important functions that aren't described above:
And voilà! We're now equipped to understand potential bottlenecks and troubleshoot issues. We'll cover that in the next post.
Posted by Guest: Spinnaker SummitLinkedIn Twitter Website