Hatofmonkeys

Thoughts no one would pay me to write

Configuration Management isn’t Stupid, but it Should Be

Devops is about holistic, systems-orientated thinking; it’s been misappropriated to be about configuration management. I’ve noticed a decline in the number of people from a development background at Devops conferences – maybe they’ve lost interest in talking about Puppet vs Chef vs Salt vs Ansible vs CFEngine vs X vs Y vs Z?

Bricks

The role of infrastructure should be to provide reliable, consumable bricks to enable innovation at higher levels. If we create beautiful, unique, novel bricks it becomes impossible to build houses. This is a problem I see regularly with OpenStack deployments; they’re amazing, wonderful, unique, organic. I cannot, however, easily deploy platforms to them.

Configuration Management

This issue manifests itself in deployments orchestrated by configuration management.

  • Complexity – you can do some amazing things in Chef as you have the power and flexibility to run arbitrary Ruby during your deployments. I have seen this abused many times – and done this myself. Too much clever branching logic and too little reliable code
  • Determinism – configuration management tools often provide a thin veneer over non-deterministic operating system commands
  • Reproducibility – server scaling operations often fail due to poor dependency management and non-deterministic actions

Configuration management is too focused on innovation at the server level rather than thinking about the entire system. Devops has become a silo.

A Better Way

There are some tools and patterns emerging to tackle these problems.

  • Immutable infrastructure – remove the drift
  • Docker/Decker – testable, simple, small, disposable containers
  • Nix – declarative, deterministic package management
  • OSV – stupid(brilliant) operating system to enable innovation
  • BOSH – stupid(brilliant) tool to deploy complex distributed systems
  • Mesos – schedule commands(jobs) to run in a distributed environment

‘Infrastructure as Code’ to ‘Infrastructure as Good Code’

We need SOLID for infrastructure. We need to develop standardised, commoditised, loosely coupled, single-responsibility components from which we can build higher-order systems and services. Only then will we be enabling innovation higher up the value chain.

Devops should be about enabling the business to deliver effectively. We’ve got stuck up our own arses configuration management.

When to Pass on a PaaS

There’s no point pretending I don’t love ‘The PaaS’. I do. I have spent too much of my career fighting battles I shouldn’t be fighting; re-inventing similar-looking systems time and time again. The idea I could just drop an application into a flexible platform and expect it to run, without consuming my entire life for the preceeding three months writing Chef cookbooks and wrestling EC2, sounds fantastic.

Cloud Foundry

Having played with Heroku, and being dismayed at not being able to play with my own Heroku, I was overjoyed when VMWare released Cloud Foundry. I created a Vagrant box on the Cloud Foundry release day and began distributing it to clients. I worked with one of my clients at the time, OpenCredo, to develop and deploy one of their new services to a Cloud Foundry installation I created. I believe this was the first SLA-led production deployment of Cloud Foundry globally.

I spoke about the rationale behind running/developing your own PaaS at QCon London 2012. I also discussed some of the use cases I’d fulfilled using PaaS, including OpenCredo’s.

QCon 2012 – Lessons Learned Deploying PaaS

OpenShift

I was similarly happy when I heard RedHat had bought Makara, a product I’d briefly experimented with, and were looking at producing their own PaaS. I’ve used RedHat-based systems for many years with great success, have always found YUM/RPM great to use, and was apparently the 7th person in the UK to achieve RedHat Architect status. A RedHat-delivered PaaS would surely be the panacea for all my problems.

I was scoping a large project at this time with availability as a prime concern. It occurred to me that I could use two turnkey PaaSes simultaneously, Cloud Foundry and OpenShift, such that if there was an issue with either I could simply direct all traffic to the other PaaS. I discussed the deployment and progress at DevopsDays.

DevopsDays Rome 2012 – How I Learned to Stop Worrying and Love the PaaS – I start about 32 minutes in.

From Dream to Reality

Unfortunately, the project didn’t quite work out as planned. We had a number of issues with OpenShift which meant we had no choice but to withdraw it from production usage. Scalability was an enormous problem; bringing an application to production scale was a sub-twenty second operation in Cloud Foundry; it took fourty-eight hours plus in OpenShift. We had to write our own deployment and orchestration layer for OpenShift based on Chef and shell – Cloud Foundry has the fantastic BOSH tool enabling deployment, scaling, and upgrades. These reasons, alongside some nasty bugs and outages, meant we were unable to use OpenShift for our deployment.

Beyond this I feel OpenShift, like many ‘PaaSish’ systems, has got the focus wrong. There seems to be a plethora of container-orchestration systems being produced at the moment, which are really just a slight re-focus on the IaaS abstraction layer. OpenShift is in danger of falling into this trap. PaaS needs to remain focussed on the application as the unit of currency, and not the container or virtual machine. It looks entirely possible (and would make an interesting project) to run Cloud Foundry inside OpenShift, illustrating the conceptual difference.

We settled on using distributed Cloud Foundry instances across diverse IaaS providers to deliver the project; it was a great success. I blogged about it for Cloud Foundry’s blog.

CloudFoundry.com blog post – UK Charity Raises Record Donations Powered by Cloud Foundry

Which PaaS?

I’ve remained supportive of RedHat’s efforts to deliver a PaaS solution but fear they’re not quite there yet. I organised a London PaaS User Group meetup to help RedHat to put their side of the case across to the London community. It sounded like they had some exciting developments in the pipeline but even with the new features it’s likely we would have been unable to deliver the enterprise-grade services our project required.

Perhaps Redhat’s customer base are largely systems administrators rather than developers. Perhaps Redhat have more experience at deploying and managing servers than applications. For whatever reason, I think it would be a denigration of PaaS to allow it to be misconstrued as container-based IaaS. Containers can be a by-product of service provision but should not be the focus of PaaS.

Cloud Foundry isn’t perfect but, at the moment, is the only PaaS product I’d recommend to anyone looking to make a long-term investment.

System Build Reproducibility

I’ve been on the receiving end of build reproducibility rants from developers at plenty of conferences. Their bile is usually aimed at Maven’s snapshot functionality. I’ve often questioned how reproducible their systems are; I’m usually met by a blank look.

I’ve always aimed to make system builds reproducible, but with little success. Gem, pear, pecl, rpm, license agreements, configure/make/make install: they all take their toll. This can lead to inconsistent builds between environments – or even in a single tier – due to scaling up/down.

As I’ve tended to use RPM-based systems (misspent youth), I’ve attempted, wherever possible, to get all non-configuration files on a server into RPMs. I’ve been more promiscuous with configuration management, moving from home grown, to Cfengine, via Puppet, to Chef. I’m currently using chef-solo, with tooling such as Noah and MCollective for orchestration. Don’t even mention the number of deployment/ALM tooling solutions I’ve been through(although Capistrano has never annoyed me to any great extent).

Even with long term usage of RPMs, build reproducibility has been far from simple. RH Satellite/Spacewalk should make this easy, but unfortunately it’s a bloated mess. I’ve usually resorted to simple apache/createrepo, but this poses its own problems. Do you have a repo per environment? How do you track which servers were built against which repo? How do you roll out updates in a manageable fashion?

I’ve created a simple setup called Yumtags! to address some of these issues. The basic idea is that you can drop RPMs in to a directory, and then “freeze” the directory at that point in time by creating and storing repository metadata against a tag. This tag can then be used, perhaps in a chef-solo-driven repository definition, to update, build, and reproduce systems in a known state. It currently features simple JSON-driven integration for CI systems, so RPM-based integration pipelines can be easily automated. There’s a million and one things missing from it, but now it does the basic story I’ve shared it for others to hack on.

Monitoring-Driven Operations

RAMBLING BLOG POST ALERT

Monitoring sucks.

Following the “if it hurts, do it more often” mantra that has driven the success of patterns such as Continuous Delivery, there might be some value in jumping head-first into the world of monitoring.

I’ve been evangelising to anyone that will listen, and a lot of people that won’t, about declarative/convergent frameworks for some time now. Sometimes you have to describe how to converge (definitions may not yet exist), so I particularly enjoy working with frameworks such as Chef that enable you to easily move from declarative to imperative as the need arises.

These two trains of thought(monitoring + system convergence) collided a while back to make me think about Monitoring-Driven Operations, i.e. that we can declare the status of the monitoring systems (that servers and services are available during expected hours) and converge the environment(s) on this desired state if there’s a gap between observed and desired states. MDO for operations, TDD for developers.

It turned out I wasn’t alone in thinking about monitoring from this perspective.

Reusing the application behaviours in production is a natural, logical extension of the testing pipeline. Were we able to ensure that the behaviours include non-functional requirements, would it be possible to use these monitored behaviours to converge an environment towards a state that passes all scenarios?

The missing link here is the imperative element. We can declare the desired state for an environment (cucumber-nagios is a good start), however we need a framework to express how we get there. I think Chef/Puppet can help with a lot of the hard labour here, but I don’t think that either, in their current formats, are appropriate to converge a service from a failed monitoring check.

Cloud Foundry(brilliant) uses a health manager, message bus, and cloud controller interacting to accomplish something similar. In this situation the cloud controller knows how to converge the state of the environment when the health manager observes a disparity between observed and desired states.

I’m thinking about developing a system that works in the following way(“The Escalator”). Please get in contact if you’ve got any feedback.

  • Barry McDevops declares an escalation pattern for his environments. These are the levels of escalation for convergence that failing checks can be matched to. As a crude example:
    1. Create account with IaaS provider
    2. Create core networking and support infrastructure
    3. Create tier networking and all VMs
    4. Create an individual VM
    5. Converge VM with Chef
    6. (Re)deploy application
  • Barry creates a series of monitoring checks for the non-functional requirements
    1. IaaS provider is online
    2. (per node) ICMP ping reply
    3. (per node) app and DB services are online
    4. (per tier) app and DB services are online
    5. Smoke check of application
  • Barry maps each monitoring check to an escalation level
    • Check 1 => Level 1
    • Check 2 => Level 4
    • Check 3 => Level 5
    • Check 4 => Level 3
    • Check 5 => Level 6
  • Once engaged, the monitoring system will quiesce for 30 seconds, observe the highest level of required escalation, and then ask the escalation system to take that level of action (and all levels below it).

As examples: – If there is nothing in place (failed IaaS provider or a new system) – an entire new environment is built from scratch (escalation steps 1 through 6). – If a node fails: a new node is built, converged, and deployed to (steps 4 through 6). – If the application fails the smoke test, the application is redeployed (step 6).

Obviously it is up to Barry to ensure his escalation steps are sensible, i.e. use multiple IaaS providers, redeploy previous versions of applications if they persistently fail smoke tests. Each escalation step should declare a quiescence period, during which no further actions will be taken. There’s no point attempting to deploy an application if you have no nodes.

If an escalation process fails, the monitoring system could attempt a re-convergence if necessary or contact an administrator.

This overlaps significantly with a couple of discussions: here and here at the recent Opscode Community Summit, so perhaps someone else is already creating something similar with Chef.

Bridging the Gap Between Functional and Non-Functional

According to the principles behind the Agile Manifesto

Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.

As I’m from an operations background, I’ve always had great trouble communicating with the customer over requirements that have been relevant to my domain. The usual situation is that the “customer”, often the product owner, views the working i.e. functionally complete software on a developer’s machine, but bringing this to an operational service level is an afterthought (until it goes offline, and the real customers start complaining).

Traditionally, capturing these specifications has fallen under the umbrella of “non-functional requirements”. As Tom Sulston pointed out to me, this suggests requirements that aren’t working, which is precisely the opposite of what we’re attempting to express.

I’ve sought to tackle this in a couple of ways.

  1. Spread FUD throughout the non-technical customer base.

    • “Do you want it to explode?”
    • “NO!”
    • “Well you should ask for non-exploding software in your stories then!”.
  2. Get the team in a room together and express “cross-cutting concerns” (I half-inched the idea from AOP) that span the project as a whole.

I haven’t been happy with the results of either approach, so I’d be interested to talk to anyone with a satisfactory solution in this space.

Continuous Deployment - A Vanity Metric?

I’ve recently seen a few articles/presentations carrying the claim “481,000 deployments a day!”, “Deployment every 3 seconds!”, or “We deploy more frequently than we breathe – or we sack the junior ops guy!”. Very exciting.

Having the capability to deploy frequently is important for a variety of reasons; fast feedback, quickly realising value, reducing risk deltas, increasing confidence, and so on. However, I also think that frequent deployment is useless without making use of that feedback.

Continuous Deployment is an enabler of fast feedback, but it’s not the end goal. If the feedback isn’t utilised by product owners to inform their decisions, there’s little point in creating it. The practice becomes a local optimisation.

I’ve deliberately chosen to differentiate Continuous Delivery from Continuous Deployment here, as I believe Continuous Delivery implies that value is being delivered, whereas Continuous Deployment suggests focusing on deploying frequently.

We need to optimise cycle time for the whole business, not just the dev/ops/devops/<current silo label> team.

The Testing Pipeline

I was fortunate in being able to attend Citcon this weekend and met some wonderful, talented people. I proposed a couple of openspaces, one loosely based on automating value and the other around test reuse.

I’ve been thinking for a while about how I see people (including myself) rewrite the same business logic in our behavioural specifications, for our integration tests, for our performance tests, for our security tests, and for our behaviour-driven monitoring. This seems counter-intuitive to someone as lazy as I am.

I think we should reconsider the deployment pipeline as a testing pipeline. The purpose of the pipeline is to increase confidence in the release candidate as the stages are passed. These stages can be considered as hoops for the release candidates to jump through. As new functionality is added, so the hoops will need to be tailored to ensure the new functionality is fit for purpose.

Here’s an idea of a delivery pipeline I’ve seen used many times before. It’s in no way ideal, nor suitable for every use case, but it provides an example for discussion:

Behaviours(product owner) –> unit(dev) –> code(dev) –> VCS –> unit tests(CI) –> behavioural tests(UAT) –> load tests(staging) –> security tests(staging) –> monitoring(production)

The issue I’m attempting to highlight is that as desired functionality is added by the product owner (typically as a feature in the backlog) and committed to a release candidate by the developer, so the hoops need to ensure that the new functionality is delivered as specified, and no regression bugs have been introduced. This typically involves either the new functionality going untested through load/security/monitoring, or a developer/operations person having to rewrite the new business logic repeatedly in the DSLs of whichever tools are used to create the hoops.

I’m currently using cucumber-nagios to reuse our functional behaviours for monitoring and I’ve also had some success using PushToTest with Selenium from Cucumber. I’ve yet to look at how to tackle security requirements. Please get in contact with me if you’ve had any success reusing your business logic assertions throughout your pipeline; I’d be very interested to hear about your experiences.

My (current) vision would be that, either as a part of or along with a release candidate, an artefact is created that outlines the business value assertions made about the candidate. As the candidate then moves through the pipeline, so the hoops are updated with the relevant assertions they need to test about the candidate. These could be a certain number of virtual users through the user journeys in the expected ratios, lack of injection holes in the application, the application’s ability to degrade gracefully through various failure modes, and anything else that’s required to establish confidence in the release candidate.

You should be able to make an API call to cloud-based testing providers(such as Soasta) so they can use your artefact to automatically test the behaviour of your application, such as external load and security tests. The feedback from these providers can be returned to your CI server for tracking over time, and form a basis for diagnosis of any assertion failures.

Without a fully-automated testing pipeline, I see one of two outcomes regularly at the moment:

  • New functionality is not tested before release. This implies risk.
  • Releases are delayed while scripts are updated, and manual load/security tests booked with vendors and executed. This inhibits business agility and delays feedback cycles.

I’m attempting to write an abstraction layer for Cucumber scenarios to be reused in load tests, if you’re interested in helping please get in touch. I’ll post it on the Githubs once I’ve got something small working.

Automating Value

I’ve always(well, for longer than my attention span) been a massive fan of outside-in development via BDD, in particular Cucumber. Despite this affection I’ve thought it a little unusual that BDD frameworks ignore/discard the business value proposition stated at the beginning of a feature. It’s always felt like we’re saying “that’s too difficult to do anything useful with, we’ll leave that to the business idiots”. With that in mind, and being a business idiot, I’m going to attempt to construct a model for value and automate the testing of business-based assertions.

What I’m really trying to achieve is to extend the automated feedback cycle back into the business. The sooner the decision makers gain meaningful feedback about the impact of their decisions, the sooner they can start making useful decisions (and fewer useless ones).

I think this is where the devops camp needs to join forces with the lean startup people (*dev*ops*). I’m very interested in exploring if there’s any useful mileage in attempting to automate the Deming Cycle, and if there isn’t, to examine the reasons why.

Humans are great for the PLAN (hypothesise) phase, although I’ve heard and read some interesting thoughts around AI and entropy being used to automate this (Skynet++).

Devops and the associated tooling are great for the DO phase, and the Continuous Delivery bible has had a massive impact here.

CHECK is often currently left either to chance or abstracted via some BI monstrosity. This is where I feel we need we may benefit from exploring automating assertions on business value. Most companies I work with have some capabilities around business metrics, but they’re infrequently linked back to the planning/strategic capabilities.

In theory an automated system can use some form of immunity metrics to roll back if the business assertions are not met, thus providing an element of automation around the ACT phase. I’ve seen a variety of immune systems employed around business metrics, however they often seem to be used as a tactical weapon rather than strategically to inform business decisions.

There are clear issues with attempting any kind of valid implementation, the obvious ones being around gaining useful feedback on a value proposition. From what I’ve looked at so far, automating value raises far more questions than it answers, but if we don’t push the boundaries on automating the feedback cycle then I feel we’re just locally optimising. I’ve had some really useful input from the folks at OpenCredo, who are thinking along similar (but better formulated) lines, now I’m trying to actually make this concrete.

At the moment I’m working on extending cucumber, specifically on “Lines 2-4 are unparsed text, which is expected to describe the business value of this feature”. I’d like to change this, making the business value executable (and testable) so we can assert the business value is being delivered as promised.

I’d like to change this

  Feature: Addition
    In order to avoid silly mistakes
    As a math idiot
    I want to be told the sum of two numbers

To this

  Feature: Addition
    In order to make 0 silly mistakes
    As a math idiot
    I want to be told the sum of two numbers

Or this

  Feature: Addition
    In order to make fewer silly mistakes
    As a math idiot
    I want to be told the sum of two numbers

And I want to assert that delivering this feature ensures the relative/absolute quantity of silly mistakes.

Attempting to write the code for this has made me realise how much time I’ve wasted in meetings of late rather than making myself useful, so if you’d like to lend a hand on the implementation of these ideas, please get in contact. I’ll post on this blog when I get the proof of concept up on the Githubs.

UPDATE Cucumber-value has now been dragged kicking and screaming on to Rubygems and been inserted forcibly up those Githubs.

Devops Days

I had the good fortune to attend Devopsdays Goteborg a few weeks back, and met a whole bunch of wonderful people. Many thanks to the legendary, words-cannot-do-this-man-justice Patrick Debois and the other organisers.

Whilst there seemed to be a lots of fantastic, enlightening conversation around the devops space, there also seemed to be a lot of complaints about lack of sponsorship for devops from those people in “the business”. I took my usual subtle, sensitive approach to a perceived problem and proposed an exploratory openspace entitled “F**k devops, noone cares, where’s my money?”. Whilst I undermined myself somewhat by turning up late for my own openspace, my ideas didn’t seem to get a huge amount of traction.

“The business case for devops” (very faint on the board)

This was probably because there was a tools-orientated openspace happening in another room.

Devops’s achilles heel is the potential it has for being introspective(ignore the ITIL/process arguments; absolute nonsense) and purely tooling focused. Changing the THEM and US attitude from dev vs ops to devops vs THE MANAGEMENT just shifts the issue. Patrick Debois brilliantly summarises that we should, instead, be thinking about *dev*ops*; unfortunately some of the devops people I’ve met seem to be more interested in teaming up against others than including them.

I’m taking a look at how I think we can address this problem, by perhaps changing the focus from “delivering” to “delivering value”. Coming from a devops background, I’m obviously going to create a tool to help me. That should teach those management types a lesson.