Deduce Your Way To DevOps Tooling

Deduce Your Way To DevOps Tooling

Are technology people inherently irrational when faced with making choices?

I have pondered this for many years, with increasing frequency as the lifetime of infrastructure tooling has rapidly decreased. We seem to be in a perpetual state of bringing the Next Big Thing into the DevOps toolset, we are never content. Gone are the days when everything could be done with OS packaged Bash and Perl.

Within this state of flux, what is it that makes some of us so vehemently for and against particular software tools that are seemingly disposable? More pressing, how can CTOs and platform leads become more effective at driving tooling choices? I witness too many subjective Ansible v Terraform v Puppet arguments, that I want to offer a systematic guide for decision making process.

Technology (“science of craft”, from Greek τέχνη, techne, “art, skill, cunning of hand”; and -λογία, -logia[2]) is the collection of techniques, skills, methods, and processes used in the production of goods or services

When engineering in a DevOps environment, the broader and deeper your toolbox, the better placed you are to pick the right tool for the job. So let’s take this approach to automation tooling.

Workload Reality

When I begin working with a client the first job is to figure out what they really need to achieve, their DevOps user story. Thinking about automation I split tasks into two categories, infrastructure tasks and operational tasks. Infrastructure tasks are creating and destroying things, ALBs, instances, VPCs, etc. Operational tasks are everything else.

I ask the client to fill out this table with all the tasks they currently do, or would like to do. For some we get started by listing their existing run-books. We always involve the people who are actually doing the work, they are the ones who really know what’s going on. I then tend to add in all the tasks they didn’t realise they should do.

Infrastructure Tasks Operations Tasks
Create VPC Modify ECS clusters
Create subnet Database restore testing
Destroy RDS instance Manage ephemeral testing env’s (query, extend)
  Flip database cluster nodes for patching
  Update ECS task definitions
  Rotate AWS IAM deployment api keys
  Enable sales-reps to spin up demo env’s

The key understanding here is that infrastructure tasks suit declarative tooling (Terraform, CloudFormation, etc), and operational tasks suit imperative tooling (Puppet, Ansible, etc).

Declarative approach: You ask a colleague to go out and fetch your lunch, they come back with a coffee and sandwich. The coffee is lukewarm, your colleague says the coffee started going cold while they were queuing for the sandwich.

With infrastructure tasks you don’t care how things are done, just that the resulting state is correct. If you want to destroy a test environment in AWS, you only care that no infrastructure is left standing. It doesn’t matter if one or other EC2 instance is destroyed first. Great for infrastructure, but not the best for lunch orders.

Imperative approach: You go out for your own lunch. You buy the sandwich first and then the coffee, because you know this ensures the coffee will still be hot when you get back to your desk. You deliberately ordered the path to the end state.

On the other hand, operational tasks are those where you generally do care about how things are done. Perhaps there is an operational task to test that a database backup can be restored. This would be done with an imperative tool, through an ordered series of state changes.

Deduce The Tooling Paradigm

Tool choice should be guided by reason, and we take the first choice here. Our goal with DevOps is all about reducing the degree of friction in the tasks we do. We can summarise the task list and give a sense of the person-hours required.

The following numbers are based on a fictitious 20-dev web company, using Agile.

  Infrastructure Tasks Operational Tasks
Task Count 4 7
Manual Task Hours 6 8
Weekly Run Hours 1 40
Monthly Run Hours 4 160
Task Type Weighting 2.4% 95.7%
Suitable paradigm declarative, imperative imperative

(imperative is suitable for infrastructure tasks simply because in a logical sense imperative programming can define any order, and therefore replicate declarative programming)

While there are many ways to slice it, any CTO or team lead will be concerned by operational hours spent on repetitive tasks. The above weighting example (heavily operational) is, by definition, most DevOps environments.

An interesting side note here. Many web startups call for DevOps engineers when they get to about year two or three. Having got that far with developers being able to stand up infrastructure and CI/CD pipelines. At this two or three year mark, a typical startup has built a revenue stream that must be protected (with app reliability, scalable performance, etc). It is often a great challenge to a company to perceive the change in dynamic to a more operationally focused set of needs.

Push Me Pull Me

The next deduction is what delivery method makes most sense. When creating things (EC2 instances, load balancer, etc) pushing change is the only way. So declarative tooling must use the push method of delivery, and this implies it must run from a control node (laptop, instance). Nice and easy.

Operational tasks are also fairly easy to assign a delivery method. If the list of operational tasks include creating any infrastructure (eg, testing database restores), then the push method is the reasoned choice.

There is an exception to this. For active resources (instances, etc) an agent process can pull configuration changes. The conundrum here is that you initially push this agent onto the resource, meaning you already have a push based automation tool.

For CTOs/leads the question here is about organisational structure. My initial probing when arriving in a company using multiple automation tools is:

  • Are infrastructure and configuration management distinct teams with little or no shared code?
  • Is there a compelling reason to run two, potentially conflicting, automation tools?

Fewer technologies in-play makes for a leaner more efficient DevOps environment. However, a warning sign that too few tools are being used is the number of hacks and workarounds being used to overcome unsuitable tooling. As someone once said, “Make things as simple as possible, but no simpler.”

Here we define the delivery method as a choice of reason:

  Infrastructure Tasks Operational Tasks
Task Type Weighting 2.4% 95.7%
Suitable paradigm declarative, imperative imperative
Delivery Method push push, sometimes pull

The push delivery method wins in most cases for the combo of infrastructure and operational tasks. When agents pull changes, you must always add additional infrastructure to monitor and check the agents.

Decision Time

In my example of a web company maturing to a more operationally focused state, we have refined our options using deduction:

Most recurring tasks will be operational, which suits the imperative paradigm, and we are required to push changes because some of those tasks also create infrastructure.

Let’s list out the current crop of CCA-like tools, and cross off any not meeting requirements. Our goal (perhaps as CTO or lead) is to arrive at a rational tooling choice, which results in a low-friction DevOps environment.

Tool Released By Paradigm Delivery Language
Ansible RedHat imperative push Python
Chef Chef declarative, imperative push, pull Ruby
CloudFormation AWS declarative push -
Puppet Puppet declarative pull Ruby
SaltStack SaltStack declarative, imperative push, pull Python
Terraform HashiCorp declarative push Go

At this point we have some remaining options to choose from, all of which meet our fictional company requirements. Refining the choice from here could be down to the skills available in the team or job market, tool maturity and community support.

It is a straightforward choice to go with Ansible in this case. We gain tool stability and maturity, plenty of community support, and the client has a ready supply of Python capable engineers. Ansible suits operational tasks, and supports all the cloud infrastructure building we require.

Requirements for other companies may come out differently, but the process of deduction from the initial list of tasks should be the same. The benefit of a systematic process is that it is transparent and easily documented. It is also easy to know when you’re requirements have changed.

The above table of options would look a little different if we were weighted to automating infrastructure tasks. We’d cross off Puppet as an option, but all others would be on the table.

One Deductive Process, Not One Tool

DevOps tends to be an operations-heavy job, and we focus on automating processes. We focus on process because this is where the largest gains of automation can be, from time savings to increasing reliability.

An objective method for choosing DevOps tooling is important. I don’t believe technology people inherently make irrational choices. I think we’re just people, and people tend to be poor at rational decision making. We tend to go with what we know, or are biased by more recent and frequent events we’re conscious of (eg. StackOverflow search results).

Using a systematic approach to tool choice should put us all in a better place. And if DevOps is about anything it’s about reducing the manual effort at the keyboard and increasing time spent engineering for simplicity.

Written on September 7, 2019