Hatofmonkeys

The Future of BOSH

2015-06-16T14:16:00+01:00

I love BOSH. I have many conversations with people about where BOSH is going so I thought it a good idea to capture my thinking in a post. This post.

Two Options

BOSH deploys and manages all the things. It’s the best distributed-system-system I’ve ever used.
BOSH should be the deployer of Diego and other container schedulers.
- Diego et al deploy and manage all the things.
- Diego becomes something like Service Foundry.
- The world is a better place because non-12-Factor-app developers have a contract to develop to, and the fast-feedback of container-driven deployments.

I currently favour option 2 but reserve the right to change my mind as I learn over time. Previous attempts to build multi-purpose stateful/stateless PaaSs haven’t gone amazingly well. If you’re building a stateful service right now, build it for BOSH, not Diego.

What Next?

We’ll be exploring the cool new future of Diego in a LoPUG hack day and looking at ideas like a Diego CPI for BOSH, persistent disks, and TCP networking in PaaS. Join us to find out more.

The CloudCredo Way

2015-01-05T08:15:00+00:00

5th January, Blog #12

The CloudCredo Way

Today, January 5th, is the last day of the twelve days of Christmas, so this is the last blog in The Twelve Blogs of Christmas. I survived. I may not have saved the best for last, but I’ve certainly saved the most important. This post is all about people, and how to get people delivering value. Great people make great things. Great things make great profit, or so I’m hoping.

Explicit or implicit culture

This post is about growing the culture within my company, CloudCredo, and the processes we use. I’ve spoken to many senior people in technology that believe you shouldn’t tell developers how to develop, and should just let them get on with it. I firmly disagree with that opinion. I think a company should be clear about how you want to deliver value, and you will then attract people that ‘fit’ your culture. If you don’t choose a way of working you are effectively saying “We let the senior staff bully the junior staff, code like cowboys, and do whatever they like.”. But that’s just my opinion.

I presented most of this content at Extreme Programmers London and had a great response. I thoroughly recommend attending their meetups.

Programming The Extremely Pivotal Way

I have been massively influenced by “The Pivotal Way”, in fact our working practices emulate it in most aspects. Imitation is the sincerest form of flattery. I had the tremendous good fortune to be invited to spend a few months in Pivotal’s Cloud Foundry Dojo (I believe I may have been the first person through it) during CloudCredo’s early days, which was an amazing experience. I’ve taken those learnings into CloudCredo and iterated on them.

We’re using the processes to deploy and develop agile platforms and infrastructure – pair-programming test-driven development with our clients. We use the Tracker-based planning process to empower authoritative product managers to prioritise bugs and features on multi-functional teams. You could call them ‘devops’ teams but I think that label has had all possible value eroded; they’re teams focused on delivering business value. You write it, you run it.

A process of feedback loops

I’m obsessed with fast feedback: I view the process as a set of shrinking feedback loops, making their way inwards towards the continuous feedback of pair programming, and expanding back out again. These loops are a good starting point until you find something that might work better. Have the courage to experiment. This content is severely abridged for brevity.

0. Pre-inception – qualify readiness to start the process

1. Inception – reach team consensus on a meaningful step forward, typically three about three months

2. Iteration Planning Meeting – plan the next iteration, pointing

3. Standup – today’s work and pairs

4. TDD (for design, refactoring, confidence, and numerous other reasons)

Pairing (remove distractions, stop cowboy coding, skills transfer, and numerous other reasons)

4. Acceptance – by an empowered product manager, the ‘CEO of the product’

3. Standup – helps/interestings/blockers from yesterday

2. Retrospective – can we improve how we’re delivering? The first derivative of delivery.

1. Review – can we improve how we’re improving on delivery? The second derivative of delivery.

Why now?

Over the course of my career I’ve delivered many projects using a traditional “up-front architecture” approach, heavy on Gantt charts and light on prototypes. This was necessary when procuring servers (and the data centre space for them to live in) took months if not years. The cost of change for the infrastructure was massive; similar to changing the foundations beneath a skyscraper. The architecture needed to be right, and IT processes evolved around these assumptions.

The big change in our industry has been the increased agility of infrastructure. A six-month server procurement has become a thirty-millisecond container-starting API call. Infrastructure is now defined as code. The incredibly low cost of change means we can now use a process that embraces, rather than resists, change – and that process is Extreme Programming(XP).

Why not Six Sigma?

Six Sigma is a fantastic methodology for eliminating defects and minimising deviation. Six Sigma is easy to implement with software: cp -ra. It’s less relevant when developing and running new services, where learning is more important.

Why not Scrum?

Scrum and XP are both agile methodologies although I believe XP has a greater focus on learning. XP embraces change during sprints/iterations, Scrum favours commitments. XP is a more holistic process, recommending engineering practices such as TDD and pair-programming. As mentioned above, I see value in forming explicit opinions on these practices.

Why not Kanban?

Tracker-based planning is actually a software-development-domain specific implementation of Kanban. Blur your eyes and you can see the swim lanes. Achieving and maintaining stable delivery flow is very important.

Learning

As we’re building and deploying new systems and services we need to embrace learning as we make progress. We’re helping our clients to learn – and we’re using a process that facilitates learning and embraces change based on that learning. Between competing organisations the organisation that can learn fastest, and channel its efforts in the right direction, will win. We learn through feedback loops such as PDCA, OODA – both developments of scientific method.

The focus of XP is learning. Learn about what you’re building so you can adjust your plans as you find out what’s good and what’s bad about what you’re building. Learn about how you’re building it; what’s working and what needs adjustment. Use emergent velocity for data-driven planning.

Supporting processes

There are a number of patterns we’re using to support our chosen methodology. Here’s a small selection:

Continuous Delivery – start with a strawman (a test, a repository, a CI server) and continue delivering from there
Microservices – people can’t solve big problems: divide and conquer
Release trains – continuously deliver all the things into an integration pipeline. If you want to ‘tag’ a release, run a release train through the last known good versions from the pipeline. Choo choo!

Openness and communication

This information begs an obvious question: if this is CloudCredo’s secret sauce, why would I blog about it publicly? This again comes back to XP, and two of XP’s core values – courage and communication. I have the courage to lay CloudCredo’s approach bare for all to see. If we’re not working in the right way, or there’s flaws, I hope they will be pointed out so we can improve. We’re continuously trying to improve. Please get in contact if you’ve got any feedback.

If you’ve been thoroughly confused by the UK Garage links at the end of all the Twelve Blogs of Christmas, well, maybe it’s a London thing.

How I Build Stuff

2015-01-04T08:06:00+00:00

4th January, Blog #11

How I Build Stuff

This is likely to be the most nonsensical of my thoroughly nonsensical Twelve Blogs of Christmas. I wanted to capture the recurring themes of the processes I’ve employed when bringing software to life; from inception to massive scale. Everything here should be overridden by domain-specific knowledge/concerns as appropriate. Or ignored altogether :)

Don’t build anything

If you’ve had an idea for something, it’s probably already been done. Go and have a look. If your idea hasn’t been done it’s probably quite likely you can reskin/resell/repackage/repurpose a current service to deliver the new service you’re hoping people will find valuable.

Start at the top, not the bottom

If your idea hasn’t been developed before, look for two or more currently running services you can combine to create the value. Use SaaS. If you have to write software, host it in a PaaS that does as much heavy lifting as possible. Don’t start with infrastructure and build upwards. You don’t have time, and you’ve got better things to do.

Get feedback from day 1

Get meaningful feedback about your service as soon as possible. Your idea is probably stupid, and you should stop (or pivot). That’s the harsh reality of the situation. Continue getting feedback about your decisions as fast as possible and as frequently as possible. Your ability to gain and act on feedback will be the key determinant of the success or failure of whatever you’re building. Read ‘The Lean Startup’. Cucumber-value is a rubbish piece of software, but should make you think about how you can quantify and measure real metrics for success.

Use small teams that ‘own’ services with business value

Business services are owned by small teams. Teams own one or more (micro?) services. Use Domain-Driven Design as a guide for breaking large problems into smaller services. Don’t overly engineer towards microservices, let them emerge, you can break them down further at a later date. Don’t be afraid to sacrifice your architecture if it needs replacing; you’re learning. Hire people that learn fast; they can gain knowledge from others in the team. Use ‘The Pivotal Way’ – lightweight XP. Initially create a Git repository per service; shared code will emerge into new repositories. Use GitHub or BitBucket – you’re not a Git hosting service (unless that’s your idea, and if it is, it’s already been done).

Deploy to somebody else’s Cloud Foundry

Deploy two(or more) Cloud Foundry instances per (micro)service. Use run.pivotal.io, Anynines, Fjord IT, AppFog, Bluemix, or any other hosted Cloud Foundry (not your own). Write and deploy your (micro)services quickly, in languages you can develop quickly in: we like Go and Ruby. If you need to develop in a less agile language as requirements emerge you can do that later. Don’t be afraid to use a range of languages; the cost of doing this is mitigated by the PaaS abstraction. Kill your (micro)services and create new ones as necessary.

Start with Continuous Delivery and continue to deliver

Begin with a test for ‘Hello World’, an app that outputs ‘Hello World’ in your chosen language/framework, and a CI server that can run your test then deploy the working software to Cloud Foundry. Iterate from there. Use Travis or Circle until you need your own Jenkins/GoCD/Concourse solution.

Log problems and measure latency

Log when things go wrong. Measure the latency(time from request to response) when things go right. Use Papertrail for the logs. Use New Relic for the metrics. Replace Papertrail with Logsearch and New Relic with Graphite when you need to.

Use JSON templates

Use JSON to communicate between your (micro)services. Use JSON templates as lightweight contracts between (micro)services. Use HTTPS with circuit breakers for synchronous communication. Use Redis or RabbitMQ for asynchronous communication. If your code can’t get to Redis/RabbitMQ/another HTTP endpoint – die quickly rather than tying up resources.

Lightweight data

Start with JSON in Redis. Move to JSON in MongoDB when you need to query the data with more flexibility. Place your BLOBs in AWS S3. Once you know whether your data – per (micro)service – needs to be consistent or available you can change data store as required. Cassandra is a great available data store. Postgres is a great consistent data store (if you can’t afford Oracle RAC, which you can’t). If you’re generating huge quantities of events throw them into Hadoop.

Mutable state is the enemy

Generate events – immutable statements of fact – based upon actions. Simple, immutable, repeatable JSON events are a great start. Store the events in an event log and use them to mutate the data stores powering consuming services – separating commands from queries.

Customise Cloud Foundry when you need to

Customising Cloud Foundry is remarkably easy; do it when requirements demand. Adding buildpacks, services, even stacks (I added Docker) is straightforward. Don’t start with Cloud Foundry customisation but don’t be afraid to take the step if you need to. Deploy to your own IaaS or even your own metal if security/regulation/performance concerns necessitate it.

Go fast and fail faster

In summary: get fast feedback about the software you’re bringing to life. Fail fast if you’re doing the wrong thing. Don’t over-engineer; do what works, deliver quickly. Realise the opportunity cost of your (and your team’s) time. Could you be building something more valuable instead?

Christmas Garage! This one’s relatively recent. And good.

The Problems With PaaS

2015-01-03T07:59:00+00:00

3rd January, Blog #10

The Problems With PaaS

Platform-as-a-Service(PaaS) isn’t perfect. There are always going to be some things it does well and some things it does badly. This post takes a look at some of the things it does badly, and how we can make improvements in the future.

Stateful services

This excellent article highlights some key points related to problems with PaaS. I absolutely agree that the data service journey with current implementations is painful. I’ve written about CloudCredo’s plans for “Service Foundry”; this is work-in-progress and needs to be an area of greater focus. Stateful services need to become first-class citizens in the PaaS landscape.

Maintenance troubles

I also agree that maintaining PaaSs is currently too difficult. I think this is a symptom of two separate issues. Firstly, the current crop of configuration management tools (Chef/Puppet/Salt/Ansible etc.) are not fit for the purpose of deploying and maintaining distributed systems, such as PaaSs. Secondly, BOSH is the right tool for the job but currently has a difficult user journey. CloudCredo have invested time and effort attempting to make BOSH easier to consume – but again it’s another work-in-progress. We need to get better at making PaaSs easier to operate, maintain, and upgrade.

Secure networking

Another good blog post highlights how networking concerns can block PaaS adoption. Since the writing of that post a couple of advancements have been made. There is now an easily consumable BOSH release to enable encrypted network traffic of any BOSH-deployed service – although we should always question how secure our encryption is. There is also user-configurable networking inside Cloud Foundry. I believe these additions go a long way towards mitigating user concerns, but I’d certainly be interested in further feedback related to PaaS networking.

Transparency

The greatest strength of PaaS is that it’s a black box for running your applications; it allows developers to focus on delivering value rather than operating a platform. The greatest weakness of PaaS is that it’s a black box for running your applications; when things go wrong it can be difficult to work out what’s happening. If you application is performing poorly on Heroku, what do you do next? Spend more money and hope? Cloud Foundry’s new Firehose generates huge volumes of information but can prove difficult to consume for PaaS novices. Buildpack integration with monitoring systems is clearly helpful but we could still make enhancements in this area.

Let’s keep PaaS-bashing

PaaS will only improve if we identify and expose the flaws. We need more users, more critiques, more real-world scenarios. Please get in contact if there’s any burning issues blocking your adoption of PaaS.

Christmas Garage: London – Stand Up Tall!

Service Foundry

2015-01-02T10:58:00+00:00

2nd January, Blog #9

Service Foundry

As I’ve mentioned a few times, Cloud Foundry has won the Platform-as-a-Service(PaaS) war for stateless 12-Factor applications. Stateless PaaS will now be a Cloud Foundry distro war. We all know mutable state is the root of all evil so I’ve been thinking about a stateful CF-like PaaS since I first interacted with Cloud Foundry. In fact, Cloud Foundry version 1 had an interesting service integration, which served to highlight the gaps in the PaaS journey for stateful application developers. I tweeted a while back about some of the work we’re doing in this area; this post will expand on our plans for ‘Service Foundry’ – a PaaS for stateful deployments.

Implementation

Docker/Rocket for application packaging
Diego with mutexs for scheduling
Etcd/Consul(reused from Diego’s deployment) for service discovery
Weave to provide inter-container networking of clustered state
Flocker for state relocation/replication
BOSH-IPSEC for security of data in-flight
HAProxy or custom Go router for non-HTTP routing
Tunable sync/non-sync disk IO
Expose STONITH semantics to users
CF-like API for management
TBC: Kubernetes for clusters(pods) of containers

Potential

It’s interesting to consider that you could deploy Cloud Foundry itself in such a system. Does this actually look like a manifesto for BOSH version 2? Or does this look more like OpenShift v3? The Kubernetes overlap certainly suggests something similar to RedHat’s latest IaaS+ effort.

Why is this useful to users?

I’ve given talks suggesting that microservices and PaaS are the future of application development. Cloud Foundry makes it incredibly easy to be tall enough for stateless microservices. I want users to have a similarly easy journey for the state powering their microservices.

If you’d like to help with this effort please do get in contact with CloudCredo.

Christmas Garage!

Mutable State

2015-01-01T13:03:00+00:00

1st January, Blog #8

Mutable State

I enjoy attending developer-focused conferences – in particular QCon and GOTO – to talk to the real consumers of PaaS. I also attend infrastructure-orientated conferences as a consumer of IaaS. I was recently on the PaaS panel at the Apache CloudStack Conference when I made what seemed a bizarre statement to many in the audience: “If your problem isn’t mutable state then you’re probably doing something wrong.”. I’m taking this opportunity to explain what I meant.

Mutable state is the problem

I’ve actually been making statements in this vein for a long time. In my experience, when delivering solutions, you can usually find a correct answer to your problem, given time to research and engineer sufficiently. The majority of optimisation problems have occurred when I’ve been facing issues related to mutable state. These generally involve functions related to latency, consistency, availability and performance. CAP theorum is perhaps the most frequently occurring; choose availability or consistency in the face of network partitions in a distributed system. You cannot have both.

Development

Approaches to mutable state vary across programming paradigms. Object-orientated programming favours encapsulating mutable state in objects; usually controlling concurrent access to the mutable data via mutexs or semaphores. In my experience this can lead to heavy performance issues when scaling. Functional programming favours having no mutable state; instead passing immutable values between functions. This causes different issues as it shifts the burden to the developer to model their data – in what can sometimes be an unnatural way.

Infrastructure

Mutable state in Infrastructure-as-a-Service can also cause issues. VMWare’s various infrastructure offerings favour a consistent view of the infrastructure landscape, often using Oracle or MSSQL as an ACID-compliant database for consumers to rely upon. This makes the API easy to consume but difficult to scale to huge levels for producers. Consumers can rely on their mutations being immediately reflected across the API.

AWS EC2 offers an eventually consistent view of their landscape. This means EC2 can be delivered at a truly unprecedented scale but can cause issues for platform-level consumers. Intelligence must be built into the tooling that consumes the API, as large consumers of EC2, such as Netflix, have done.

Platform

At a Platform-as-a-Service(PaaS) level we have tended to deal with the issue of mutable state by abdicating it to external services. Stateless application hosting is a done deal – Cloud Foundry has won that battle. Stateless application developers can choose to locate their data in whichever external service best suits the data; this is the polyglot approach embraced by PaaS. The real issues arise when we attempt to provide a Cloud-Foundry-like journey for developers and operators around stateful services.

An increasing number of PaaS developers appear to have become preoccupied with scheduling. Scheduler debates are so hot right now. No PaaS conference would be complete without an Omega-style optimistic scheduler versus Mesos-style pessimistic scheduler conversation. In my experience scheduling performance is rarely the constraint blocking PaaS adoption: the constraint is usually dealing with the necessary mutable state generated/consumed by the applications running in the PaaS.

Two PaaSs to rule them all?

RedHat recently stated that dividing stateful and stateless services “is an arbitrary distinction”. I don’t agree with this perspective; I think it’s a very important distinction. There are some key issues here – if an application container appears to have stopped functioning, what action should the PaaS take? If the container is stateless the PaaS can request a new container be started: potentially creating a duplicate. If the container is stateful the choice is more complicated. Should a new container be started, maximising service availability, but risking split-brain scenarios in a stateful environment? Should the container remain offline, reducing availability but ensuring data consistency? Should the container be restarted following a successful fencing operation? What does fencing look like in a distributed, scheduled, containerised PaaS environment? These questions may lead to a separate breed of stateful PaaSs emerging focusing on stateful concerns, even if they share similar-looking APIs.

cf push mysql/mysql --disk 500 --restart-policy consistent  --disk-sync immediate --io-performance high
cf push redis/redis --disk 100 --restart-policy available --disk-sync defer --io-performance medium

BOSH

The “Two PaaS” debate has led to some people perceiving Cloud Foundry as a stateless PaaS and BOSH as a stateful PaaS. I believe this is an incorrect interpretation. I think BOSH is a fantastic system – it has had a greater influence on me than Cloud Foundry itself – but it is not a PaaS. BOSH is the purest embodiment of the principles of Infrastructure as Code: it fires up metal, lays down code, and attaches disks for state. It is not, in its current form, a scheduler in the manner of Mesos/Omega. BOSH is the world’s best deployer of schedulers and distributed systems. A stateful PaaS would look more like Diego/Lattice, modified to address the concerns above. BOSH would be a great way to deploy this new PaaS.

Christmas Garage!

Multi-Site Cloud Foundry

2014-12-31T10:20:00+00:00

31st December, Blog #7

Multi-Site Cloud Foundry

CloudCredo’s clients often ask about how to run Cloud Foundry across multiple sites for performance and resilience. This is a natural question to ask: as an application developer, if I specify that I need to have ten instances of my application available, I expect ten instances to be available – whether the underlying infrastructure is available or not. The Platform-as-a-Service(PaaS) abstraction implies that my application should continue to run as desired, and that the PaaS should be designed to handle failures in the infrastructure layer.

Choosing a CAP

This idea sounds great in theory but can lead to some problems in practice. What should Cloud Foundry do, as a distributed system, in the event of a network partition? Should each partition converge on the desired state of the whole system, leading to twice the number of applications being online, and potential split-brain issues with singleton applications? Should the whole Cloud Foundry shut down to ensure application consistency? The key point to note here, which is not immediately obvious, is that whilst Cloud Foundry hosts stateless applications, it is actually a stateful system itself. This means CAP theorem must be obeyed; we can opt for consistency or availability in our Cloud Foundry deployment.

Cloud Foundry has, at its core, a Cloud Controller database maintaining the desired state of the system. If we maintain a single, consistent state of this database we will be running a CP(consistent, partition-tolerant) Cloud Foundry. If we allow multiple, divergent copies of this database we will be running an AP(available, partition-tolerant) Cloud Foundry.

Consistent Cloud Foundry

Cloud Foundry’s current engineering direction seems to favour a consistent view of the desired application state. This is exhibited by the choice of Raft as a consensus algorithm, and Etcd as an implementation, for CF’s next generation Diego components. These choices pose some difficult questions:

Do we suffer thundering herd issues in the event of a network partition, where all desired instances of all applications attempt to run on the nodes with access to the quorate Etcd partition?
How do we ensure data services, required by the applications, are accessible from the functioning side of a partition?
How do we ensure latency to data services from across multiple sites does not reduce application performance below acceptable levels?
What happens if the Cloud Foundry components themselves experience issues, degrading the service as a whole across all sites?

Running a consistent Cloud Foundry makes application deployment and management very easy; there is a single API endpoint for management. At CloudCredo we deploy consistent Cloud Foundry installations across multiple availability zones within a single region. This mitigates the latency concerns and provides a convenient management structure for application deployment.

Available Cloud Foundry

When I’ve needed to deploy Cloud Foundry across multiple regions it has usually been to provide for a very high availability service level. To provide for this I have deployed multiple, completely separate Cloud Foundry installations, often on heterogeneous infrastructure providers, in diverse regions. An example of this kind of installation would be the donations platform for Comic Relief. The huge benefit of this strategy is that no outages in any individual Cloud Foundry can stop the service – availability is maximised. The two major downsides are that deployment and orchestration of applications becomes significantly more complex, and that the applications and data services need to be developed to handle being deployed in a distributed system of this nature.

The Perfect Cloud Foundry?

I’m currently deploying multi-zone ‘consistent CF’ – and then multi-region ‘available CF’ if requirements demand it. This is a domain specific choice; and it takes a significant amount of work in the application(particularly around state) to bring multi-region availability.

Christmas Garage!

PaaSaaP and the Distro Wars

2014-12-30T08:37:00+00:00

30th December, Blog #6

PaaSaaP and the Distro Wars

Cloud Foundry has won the PaaS war. Heroku blazed the trail with their PaaS for Twelve-Factor Apps, and VMWare/Pivotal have built an open-source implementation and ecosystem to provide both enterprises and developers with an obvious choice. Buildpacks have become the standard for translation from application code to runnable unit.

The question is rapidly changing from “which PaaS are you using?” to “which distribution of Cloud Foundry are you using?”. Just as Hadoop became synonymous with big data so Cloud Foundry has become PaaS. The emerging battle is between the distribution vendors; Pivotal, IBM, HP and ActiveState already have offerings in the market. I’m sure we’ll see more from other players as the Cloud Foundry Foundation gains momentum.

The other fascinating aspect of the development of the Cloud Foundry ecosystem is ‘Platform as a Service’ versus ‘Platform as a Service as a Product’(PaaSaaP). Some vendors are offering Cloud Foundry as installable, supported software – where the onus is on the customer to deploy the software to their chosen infrastructure in order to provide a service. Other vendors are deploying and running Cloud Foundry on behalf of their uses to provide a true PaaS experience. A few vendors are offering both. Some PaaS purists have denigrated Cloud Foundry for offering this flexibility, but I see this as one of Cloud Foundry’s greatest strengths. Developers can minimise Time-to-Value by deploying quickly to a vendor’s cloud-based solution – and then deploy Cloud Foundry to their own infrastructure when non-functional requirements emerge to make a custom deployment necessary.

We will also see domain-specific Cloud Foundy implementations for particular markets. The ‘core’ Cloud Foundry specification, provided by the CF Foundation, will provide a key set of capabilities and an API for developers to work against – but we will see extensions providing for additional requirements and innovation. Time will tell which flavours of Cloud Foundry are successful and which are left behind.

Christmas Garage!

Twelve Factor Enterprise

2014-12-29T10:16:00+00:00

29th December, Blog #5

Twelve Factor Enterprise

The big paradigm shift for developers using Platform-as-a-Service(PaaS) is understanding “The Twelve-Factor App”; a set of patterns developed by the team at Heroku enabling applications to be orchestrated and distributed at scale. By adopting these patterns developers can take advantage of PaaS, via services such as Heroku and Cloud Foundry, reducing their operational responsibilities so they can focus on delivering value.

I’ve had a long-running debate with various members of the PaaS community about 12-Factor’s relevance to enterprises. I’ve heard many claims that enterprises don’t want to adopt these patterns, and would rather mix their state, config, and application together as a tangled ball of mud. At a board level the enterprises I’ve worked for, and interacted with, have been seeking organisational agility – the kind of fast delivery and iteration PaaS brings to software development. Somehow this message often gets lost in middle management, leading to a resistance to change and clinging to legacy practices like a safety blanket.

We need to stop telling enterprises that modern patterns, such as 12 Factor and Microservices, won’t work for them. We need to help enterprises to lower their time-to-value and increase their operational efficiency. The only winners from keeping enterprises stuck in the dark ages are the incumbent vendors, happy to continue charging extortionate prices for outdated systems and software.

Enterprises can adopt 12 Factor, Microservices, and PaaS. They want to be more agile, not less. Let’s help them.

Christmas Garage!

Containers as a Service

2014-12-28T09:43:00+00:00

28th December, Blog #4

Containers as a Service

At a simple level, Infrastructure-as-a-Service deals with virtual machines, Platform-as-a-Service deals with applications, and Software-as-a-Service deals with users. I’ve spoken a few times about Containers-as-a-Service(CaaS) – the idea that containers will become a new meaningful unit of currency in cloud computing. I’ve also written about the ramifications of CaaS for PaaS.

The hottest debate in the container ecosystem seems to be whether to use containers as single processes or multi-process virtual machines. The Docker folks seem to encourage the use of single process containers – a view I’m supportive of, following principles from software development, such as single responsibility and inversion of control. The alternative approach is to use containers as virtual machines with many processes. The virtual machine approach seems to resonate well with the current generation of operating systems and configuration management tooling.

I believe both approaches will continue to thrive for some time. The true value in Docker lies in the packaging and portability of containers, and this holds true whether you’re packaging filesystems for a single process or a number of processes. Containers used as virtual machines will quickly erode the IaaS market – if only for their portability, speed, and density. Consumers and producers can gain benefits when choosing the CaaS abstraction over IaaS.

Over time there will be a move towards per-process containers, away from virtual machine-like containers, as the deployment and orchestration benefits from this approach become clearer. Just as SOLID, and TDD took time to gain momentum in software development so the correct patterns for containers will gain traction over time.

We will also see increasing numbers of customised operating systems, designed to run with the Linux kernel but providing a subset of the functionality, tuned for single-purpose containers. OSv is a great example of this. I also think we’re likely to see an increasing number of container host OSs, presenting something that looks like a Linux kernel to hosted containers, under development. The boundaries between hypervisor and namespaced kernel will blur.

The other main areas for innovation will be storage and networking. Providing containers with reliable storage for mutable state is challenging. The ClusterHQ folks have made great progress in this area using ZFS-on-Linux in Flocker. There are also exciting developments in the networking space from SocketPlane and Weave – simplifying multi-host container networking.

CaaS solutions have already emerged from the big cloud players; you can easily host containers on GCE, AWS, and Azure. Rancher looks to enable CaaS on any infrastructure. Pivotal’s Lattice enables operators to deliver their own CaaS, as does its big brother Cloud Foundry. These solutions are currently largely focused towards Docker but I’m sure we’ll see Rocket and alternatives emerge.

Whilst the future may look bright for containers, there are some warning signs. Multi-tenant container isolation and security are currently far from production ready. It also seems that choosing a layered filesystem beneath your containers can be a risky business; I’ve been bitten by few bugs.

I think CaaS will grow exponentially over the next few years. Tutum are building the first pure CaaS I’ve seen. They attracted $2.65M in seed funding this year. I’m sure they’ll have plenty of competition in the near future.

Christmas Garage!

The Heterogeneous Workforce

2014-12-27T10:20:00+00:00

27th December, Blog #3

The Heterogeneous Workforce

I enjoy regularly visiting and speaking at tech conferences. They’re a great way to stay abreast of trends and keep a fresh perspective on how the industry is delivering value. Unfortunately I’m usually dismayed at the homogeneous composition of the conference attendees; the overwhelming majority are white males, like myself.

As a company CEO I see this as a tremendous waste. The key role I play in the company is the recruitment and retention of the members of our great team. Enlarging the pool of resource I can draw from will only increase the quality of the team, the diversity of opinions, and create a balanced working environment. I’m not trying to get on my moral high horse; I want a better team so I can make more money.

If you’re interested in PaaS, distributed systems, and Extreme Programming, please do contact CloudCredo – especially if you don’t ‘fit the mould’. We’re looking for fast learners and good communicators. I’d love to think there’s a pool of untapped skills out there I can use to help build our team. I can’t promise to run the perfect company, but I can promise to try.

More Christmas UKG!

Cloud Foundry Summit 2014

2014-12-26T08:39:00+00:00

26th December, Blog #2

I wrote this blog back in June but didn’t have time to publish it. No time like the present.

Cloud Foundry Summit 2014

Now the dust has settled following the CF Summit I thought it a good time to note some of the lasting impressions.

CF Summit was a great conference. The buzz and energy around the place was tangible. This year’s host, Andrew Clay Shafer, gave the event a friendly, personable feel, and also has(had) great hair. The big hitters from the Cloud Foundry community were present, along with a newer users from a diverse range of organisations.

Personally, I felt the Summit represented the emergence of Cloud Foundry as the leading enterprise PaaS. I backed Cloud Foundry from day one, and I’ve been vocal about the shortcomings of other PaaSes that fall short of CF. Witnessing some of the biggest names in tech queuing up to take the stage to endorse Cloud Foundry was immensely gratifying. The corollary to those endorsements was the quantity of real world Cloud Foundry success stories in the talks; Cloud Foundry has broken through, and proven its worth.

One of the chief reasons for the dramatic rise in enterprise adoption of Cloud Foundry has been the formation of the CF Foundation. The Foundation members were well represented at CF Summit, from small innovators such as CloudCredo, to tech giants such as IBM and Cisco. It’s fantastic to see the Cloud Foundry ecosystem crystallise as a Foundation to move the project forward.

The noticeable trend amongst the Summit attendees was big businesses realising the importance of speed. Canopy are shining example of this; formed from large enterprises but able to quickly and effectively make use of outstanding new technologies. I fear for their competition; it’s striking when organisations decisively deliver capability at this scale.

We were overjoyed with the Summit from a CloudCredo perspective. The ecosystem’s response to our work with them was fantastic. We are proud our work was mentioned in the following talks:

Ben Hale’s Buildpack Talk
Warner Music
Canopy
Cisco
James Bayer – Cloud Foundry Roadmap
Tammer Saleh and Chris Brown – Cloud Foundry Data Services
Ferdy’s BOSH/Docker talk
My Decker talk

We’d love to be able to add your organisation’s name to that list. Please get in contact to talk about how we can help you deliver with Cloud Foundry.

Christmas Garage!

The Twelve Blogs of Christmas

2014-12-25T12:19:00+00:00

25th December, Blog #1

Twelve blogs, twelve days

I’ve got a backlog of half-formed blog posts (backblog?) that have been hanging around, like bad smells, for a while. I’ve decided to get them all done and out over Christmas. I’m cheating by imposing the following restrictions:

This is the first blog and counts as a blog post even though it’s a meta-post
I reserve the right to post days, or even weeks, late. I’m in Sark for some of Christmas and I will be burned alive in a wicker man if I’m seen using a laptop
I reserve the right to post nonsense due to Christmas-related alcohol abuse
I reserve the right to blame alcohol for nonsense I’ve posted while sober
I have about eight posts lined up, most of which are nonsense
The remaining three posts are TBC and are guaranteed to be nonsense
I reserve the right to edit/delete all blogs at a later date once I realise how nonsensical they are

Random asides

I’ve noticed the vast majority of my traffic is from the USA. As part of my single-handed mission to bring back UK Garage, and further its influence in the world, I will be linking to some of the UKG greats at the end of each blog post.

Bo selecta!

Let the blogging commence!

Articles

Docker, Investors, and Ecosystems

2014-12-03T09:13:00+00:00

Rocket

As you’re probably aware by now – CoreOS have released a new container runtime called ‘Rocket’. Rocket will inevitably be perceived as a reaction, and competitor, to Docker. The language in CoreOS’s Rocket blog post suggests Docker have deviated from CoreOS’s preferred path, and the tone suggests an accusation of betrayal. The timing of CoreOS’s announcement, the week before DockerCon EU, implies a deliberate attempt to undermine Docker’s marketing.

Reaction

GigaOM have published a great summary of the community’s reaction here. Given Docker’s amazing ecosystem and adoption levels I’m actually quite surprised more people haven’t come out in strong support of Docker. Solomon Hykes has exactly three things to say (followed by a subsequent thirteen) here.

The Changing Docker

I spoke at DockerCon 2014 about the Cloud Foundry/Docker integration project I’d been working on. This work has now been subsumed into the Docker on Diego work which is beginning to form the core of Cloud Foundry V3. While working on the initial proof of concept I had regular communication with the people at Docker. I found them friendly and open to the idea of using Docker containers as another deployable unit within Cloud Foundry.

Once Docker raised their $40M investment the tone changed. Docker became a ‘platform’. Docker’s collaboration with Cloud Foundry, to use Docker inside Cloud Foundry, seemed to stall. It appeared Docker were trying to eat their ecosystem. Was investor pressure for a huge return causing Docker to try to capture too many markets rather than focusing on their core?

Cloud Foundry Future

Cloud Foundry will continue to orchestrate various units of currency; applications via buildpacks, containers via Docker, and potentially containers via Rocket. My company, CloudCredo, is already looking at what a Rocket integration with Cloud Foundry would look like. cf push https://storage-mirror.example.com/webapp-1.0.0.aci is on the way.

Configuration Management isn't Stupid, but it Should Be

2014-03-15T10:37:00+00:00

Devops is about holistic, systems-orientated thinking; it’s been misappropriated to be about configuration management. I’ve noticed a decline in the number of people from a development background at Devops conferences – maybe they’ve lost interest in talking about Puppet vs Chef vs Salt vs Ansible vs CFEngine vs X vs Y vs Z?

Bricks

The role of infrastructure should be to provide reliable, consumable bricks to enable innovation at higher levels. If we create beautiful, unique, novel bricks it becomes impossible to build houses. This is a problem I see regularly with OpenStack deployments; they’re amazing, wonderful, unique, organic. I cannot, however, easily deploy platforms to them.

Configuration Management

This issue manifests itself in deployments orchestrated by configuration management.

Complexity – you can do some amazing things in Chef as you have the power and flexibility to run arbitrary Ruby during your deployments. I have seen this abused many times – and done this myself. Too much clever branching logic and too little reliable code
Determinism – configuration management tools often provide a thin veneer over non-deterministic operating system commands
Reproducibility – server scaling operations often fail due to poor dependency management and non-deterministic actions

Configuration management is too focused on innovation at the server level rather than thinking about the entire system. Devops has become a silo.

A Better Way

There are some tools and patterns emerging to tackle these problems.

Immutable infrastructure – remove the drift
Docker/Decker – testable, simple, small, disposable containers
Nix – declarative, deterministic package management
OSV – stupid(brilliant) operating system to enable innovation
BOSH – stupid(brilliant) tool to deploy complex distributed systems
Mesos – schedule commands(jobs) to run in a distributed environment

‘Infrastructure as Code’ to ‘Infrastructure as Good Code’

We need SOLID for infrastructure. We need to develop standardised, commoditised, loosely coupled, single-responsibility components from which we can build higher-order systems and services. Only then will we be enabling innovation higher up the value chain.

Devops should be about enabling the business to deliver effectively. We’ve got stuck up our own ~~arses~~ configuration management.

When to Pass on a PaaS

2013-06-30T18:04:00+01:00

There’s no point pretending I don’t love ‘The PaaS’. I do. I have spent too much of my career fighting battles I shouldn’t be fighting; re-inventing similar-looking systems time and time again. The idea I could just drop an application into a flexible platform and expect it to run, without consuming my entire life for the preceeding three months writing Chef cookbooks and wrestling EC2, sounds fantastic.

Cloud Foundry

Having played with Heroku, and being dismayed at not being able to play with my own Heroku, I was overjoyed when VMWare released Cloud Foundry. I created a Vagrant box on the Cloud Foundry release day and began distributing it to clients. I worked with one of my clients at the time, OpenCredo, to develop and deploy one of their new services to a Cloud Foundry installation I created. I believe this was the first SLA-led production deployment of Cloud Foundry globally.

I spoke about the rationale behind running/developing your own PaaS at QCon London 2012. I also discussed some of the use cases I’d fulfilled using PaaS, including OpenCredo’s.

QCon 2012 – Lessons Learned Deploying PaaS

OpenShift

I was similarly happy when I heard RedHat had bought Makara, a product I’d briefly experimented with, and were looking at producing their own PaaS. I’ve used RedHat-based systems for many years with great success, have always found YUM/RPM great to use, and was apparently the 7th person in the UK to achieve RedHat Architect status. A RedHat-delivered PaaS would surely be the panacea for all my problems.

I was scoping a large project at this time with availability as a prime concern. It occurred to me that I could use two turnkey PaaSes simultaneously, Cloud Foundry and OpenShift, such that if there was an issue with either I could simply direct all traffic to the other PaaS. I discussed the deployment and progress at DevopsDays.

DevopsDays Rome 2012 – How I Learned to Stop Worrying and Love the PaaS – I start about 32 minutes in.

From Dream to Reality

Unfortunately, the project didn’t quite work out as planned. We had a number of issues with OpenShift which meant we had no choice but to withdraw it from production usage. Scalability was an enormous problem; bringing an application to production scale was a sub-twenty second operation in Cloud Foundry; it took fourty-eight hours plus in OpenShift. We had to write our own deployment and orchestration layer for OpenShift based on Chef and shell – Cloud Foundry has the fantastic BOSH tool enabling deployment, scaling, and upgrades. These reasons, alongside some nasty bugs and outages, meant we were unable to use OpenShift for our deployment.

Beyond this I feel OpenShift, like many ‘PaaSish’ systems, has got the focus wrong. There seems to be a plethora of container-orchestration systems being produced at the moment, which are really just a slight re-focus on the IaaS abstraction layer. OpenShift is in danger of falling into this trap. PaaS needs to remain focussed on the application as the unit of currency, and not the container or virtual machine. It looks entirely possible (and would make an interesting project) to run Cloud Foundry inside OpenShift, illustrating the conceptual difference.

We settled on using distributed Cloud Foundry instances across diverse IaaS providers to deliver the project; it was a great success. I blogged about it for Cloud Foundry’s blog.

CloudFoundry.com blog post – UK Charity Raises Record Donations Powered by Cloud Foundry

Which PaaS?

I’ve remained supportive of RedHat’s efforts to deliver a PaaS solution but fear they’re not quite there yet. I organised a London PaaS User Group meetup to help RedHat to put their side of the case across to the London community. It sounded like they had some exciting developments in the pipeline but even with the new features it’s likely we would have been unable to deliver the enterprise-grade services our project required.

Perhaps Redhat’s customer base are largely systems administrators rather than developers. Perhaps Redhat have more experience at deploying and managing servers than applications. For whatever reason, I think it would be a denigration of PaaS to allow it to be misconstrued as container-based IaaS. Containers can be a by-product of service provision but should not be the focus of PaaS.

Cloud Foundry isn’t perfect but, at the moment, is the only PaaS product I’d recommend to anyone looking to make a long-term investment.

System Build Reproducibility

2012-01-15T09:59:00+00:00

I’ve been on the receiving end of build reproducibility rants from developers at plenty of conferences. Their bile is usually aimed at Maven’s snapshot functionality. I’ve often questioned how reproducible their systems are; I’m usually met by a blank look.

I’ve always aimed to make system builds reproducible, but with little success. Gem, pear, pecl, rpm, license agreements, configure/make/make install: they all take their toll. This can lead to inconsistent builds between environments – or even in a single tier – due to scaling up/down.

As I’ve tended to use RPM-based systems (misspent youth), I’ve attempted, wherever possible, to get all non-configuration files on a server into RPMs. I’ve been more promiscuous with configuration management, moving from home grown, to Cfengine, via Puppet, to Chef. I’m currently using chef-solo, with tooling such as Noah and MCollective for orchestration. Don’t even mention the number of deployment/ALM tooling solutions I’ve been through(although Capistrano has never annoyed me to any great extent).

Even with long term usage of RPMs, build reproducibility has been far from simple. RH Satellite/Spacewalk should make this easy, but unfortunately it’s a bloated mess. I’ve usually resorted to simple apache/createrepo, but this poses its own problems. Do you have a repo per environment? How do you track which servers were built against which repo? How do you roll out updates in a manageable fashion?

I’ve created a simple setup called Yumtags! to address some of these issues. The basic idea is that you can drop RPMs in to a directory, and then “freeze” the directory at that point in time by creating and storing repository metadata against a tag. This tag can then be used, perhaps in a chef-solo-driven repository definition, to update, build, and reproduce systems in a known state. It currently features simple JSON-driven integration for CI systems, so RPM-based integration pipelines can be easily automated. There’s a million and one things missing from it, but now it does the basic story I’ve shared it for others to hack on.

Monitoring-Driven Operations

2012-01-15T08:11:00+00:00

RAMBLING BLOG POST ALERT

Monitoring sucks.

Following the “if it hurts, do it more often” mantra that has driven the success of patterns such as Continuous Delivery, there might be some value in jumping head-first into the world of monitoring.

I’ve been evangelising to anyone that will listen, and a lot of people that won’t, about declarative/convergent frameworks for some time now. Sometimes you have to describe how to converge (definitions may not yet exist), so I particularly enjoy working with frameworks such as Chef that enable you to easily move from declarative to imperative as the need arises.

These two trains of thought(monitoring + system convergence) collided a while back to make me think about Monitoring-Driven Operations, i.e. that we can declare the status of the monitoring systems (that servers and services are available during expected hours) and converge the environment(s) on this desired state if there’s a gap between observed and desired states. MDO for operations, TDD for developers.

It turned out I wasn’t alone in thinking about monitoring from this perspective.

Reusing the application behaviours in production is a natural, logical extension of the testing pipeline. Were we able to ensure that the behaviours include non-functional requirements, would it be possible to use these monitored behaviours to converge an environment towards a state that passes all scenarios?

The missing link here is the imperative element. We can declare the desired state for an environment (cucumber-nagios is a good start), however we need a framework to express how we get there. I think Chef/Puppet can help with a lot of the hard labour here, but I don’t think that either, in their current formats, are appropriate to converge a service from a failed monitoring check.

Cloud Foundry(brilliant) uses a health manager, message bus, and cloud controller interacting to accomplish something similar. In this situation the cloud controller knows how to converge the state of the environment when the health manager observes a disparity between observed and desired states.

I’m thinking about developing a system that works in the following way(“The Escalator”). Please get in contact if you’ve got any feedback.

Barry McDevops declares an escalation pattern for his environments. These are the levels of escalation for convergence that failing checks can be matched to. As a crude example:
1. Create account with IaaS provider
2. Create core networking and support infrastructure
3. Create tier networking and all VMs
4. Create an individual VM
5. Converge VM with Chef
6. (Re)deploy application
Barry creates a series of monitoring checks for the non-functional requirements
1. IaaS provider is online
2. (per node) ICMP ping reply
3. (per node) app and DB services are online
4. (per tier) app and DB services are online
5. Smoke check of application
Barry maps each monitoring check to an escalation level
- Check 1 => Level 1
- Check 2 => Level 4
- Check 3 => Level 5
- Check 4 => Level 3
- Check 5 => Level 6
Once engaged, the monitoring system will quiesce for 30 seconds, observe the highest level of required escalation, and then ask the escalation system to take that level of action (and all levels below it).

As examples: – If there is nothing in place (failed IaaS provider or a new system) – an entire new environment is built from scratch (escalation steps 1 through 6). – If a node fails: a new node is built, converged, and deployed to (steps 4 through 6). – If the application fails the smoke test, the application is redeployed (step 6).

Obviously it is up to Barry to ensure his escalation steps are sensible, i.e. use multiple IaaS providers, redeploy previous versions of applications if they persistently fail smoke tests. Each escalation step should declare a quiescence period, during which no further actions will be taken. There’s no point attempting to deploy an application if you have no nodes.

If an escalation process fails, the monitoring system could attempt a re-convergence if necessary or contact an administrator.

This overlaps significantly with a couple of discussions: here and here at the recent Opscode Community Summit, so perhaps someone else is already creating something similar with Chef.

Bridging the Gap Between Functional and Non-Functional

2012-01-15T07:54:00+00:00

According to the principles behind the Agile Manifesto

Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.

As I’m from an operations background, I’ve always had great trouble communicating with the customer over requirements that have been relevant to my domain. The usual situation is that the “customer”, often the product owner, views the working i.e. functionally complete software on a developer’s machine, but bringing this to an operational service level is an afterthought (until it goes offline, and the real customers start complaining).

Traditionally, capturing these specifications has fallen under the umbrella of “non-functional requirements”. As Tom Sulston pointed out to me, this suggests requirements that aren’t working, which is precisely the opposite of what we’re attempting to express.

I’ve sought to tackle this in a couple of ways.

Spread FUD throughout the non-technical customer base.
- “Do you want it to explode?”
- “NO!”
- “Well you should ask for non-exploding software in your stories then!”.
Get the team in a room together and express “cross-cutting concerns” (I half-inched the idea from AOP) that span the project as a whole.

I haven’t been happy with the results of either approach, so I’d be interested to talk to anyone with a satisfactory solution in this space.

Continuous Deployment - A Vanity Metric?

2012-01-15T07:12:00+00:00

I’ve recently seen a few articles/presentations carrying the claim “481,000 deployments a day!”, “Deployment every 3 seconds!”, or “We deploy more frequently than we breathe – or we sack the junior ops guy!”. Very exciting.

Having the capability to deploy frequently is important for a variety of reasons; fast feedback, quickly realising value, reducing risk deltas, increasing confidence, and so on. However, I also think that frequent deployment is useless without making use of that feedback.

Continuous Deployment is an enabler of fast feedback, but it’s not the end goal. If the feedback isn’t utilised by product owners to inform their decisions, there’s little point in creating it. The practice becomes a local optimisation.

I’ve deliberately chosen to differentiate Continuous Delivery from Continuous Deployment here, as I believe Continuous Delivery implies that value is being delivered, whereas Continuous Deployment suggests focusing on deploying frequently.

We need to optimise cycle time for the whole business, not just the dev/ops/devops/ team.