The run book (or system operation manual) is traditionally written by the IT operations (Ops) team after software development is considered complete. However, this typically leads to operability problems being discovered with the software, operational concerns having been ignored, forgotten, or not fully addressed by the development (Dev) team.
If the software development team writes a draft run book or draft operation manual, many of the operational problems typically found during pre-live system readiness testing can be caught and corrected much earlier. Because the development team needs to collaborate with the operations team in order to define and complete the various draft run book details, the operations team also gains early insight into the new software. Channels of communication, trust, and collaboration are established between the traditionally siloed Dev and Ops teams, which can help to establish and strengthen a DevOps approach to building and running software systems.
Note: I actually agree with much of what Jeff Goldschrafe says on run books; if we rely on run books to help us actually operate a system in 2013 or later, we have likely not automated or monitored enough. The key point about run book collaboration is to get Dev and Ops talking to each other about operational features during development, rather than leaving those conversations to a failure-laden 'Production-ization' phase.
Address Operational Concerns from the Start of the Project or ProgrammeMany software projects or programmes tend to leave consideration of operational concerns until close to launch or 'go live'. This is often due to a software development team (or budget holder) driven primarily by end-user features; the operability of the software is considered to be a 'problem for the Ops team' to be addressed during a so-called 'Production-ization' phase (a terrible term). Even where software has undergone some level of capacity or load testing before being handed to the Operations team, crucial differences between test environments and the Production environment are often assumed to be irrelevant. Almost invariably this leads to a last-minute rush of bug fixes, hacks, and workarounds by both Dev and Ops, along with much gnashing of teeth and complaints about 'stupid developers' from Ops and 'I just want it to work' from Dev.
In extreme cases, the software may have to be substantially or entirely re-written to address operability problems, suggesting that a buffer is needed between the first operability test and the go-live date. At one software consultancy/outsourcing company I worked at, we used to insist that substantial technical testing should be conducted half-way through the project, so that we still had half the project timeline to radically change the performance characteristics of our software if needed; this strategy helped us on more than one occasion when the load characteristics of our software were unexpected.
By starting technical testing earlier in the cycle, we expose some operational problems sooner, so rather than a major problem occurring just before the launch date, we typically see several smaller, more tractable problems arising. However, the "50% timeline" approach still suffers from not treating operational concerns as first-class features alongside end-user features, and can really only give us a single chance to correct our 'best guesses' about operability. A much better approach to making our software truly operable is to consider operational features from the very start of the programme of work (or project); operational features (typically seen as 'non-function requirements') should be included in the product backlog alongside end-user features (typically seen as 'functional requirements'). In this way, the Dev team has a better chance of understanding and addressing these crucial and often project-critical aspects before they derail or delay a software launch.
As operational concerns are identified and addressed throughout the duration of the development phase, both the Ops team and the Dev team become more confident that the software will work well in Production on the go-live date, gaining each other's trust. Dev teams have not traditionally included people with much operational experience, although this is changing as DevOps approaches demonstrate the value of greater cross-functional working methods. Whether our Dev team has embedded Ops people or not, many Dev folk have had little exposure to operational issues, and so it can be difficult for them to anticipate the software changes needed to make the software operable. This is where the draft run book can play a vital role.
The Draft Run Book as a Collaboration Tool for Dev and OpsThe 'run book' (sometimes called the 'system operation manual', or just 'operation manual') is a collection of procedures and steps for operations teams to follow (either manually, or through run book automation) in order to enable the software to run effectively in Production. A run book includes details of how the operations team should deal with things like daylight saving time changes, data cleardown, recovery from failover, server patching, troubleshooting, and so on. Historically, it was the Ops team that wrote the run book, based on chance conversations with the Dev team, sketchy documentation, and much trial and error. However, by turning around this situation and giving responsibility for the first few drafts of the run book to the Dev team, substantial improvements to the operability of the software system can result.I have found that many developers (I include myself here) are often surprised at the 'stuff' which is needed to make software work in Production (content switch rulesets, SSL offloading, a separate management NIC, data cleardown, etc.) and appreciate the opportunity to make their code better. Also, if operational features are written as familiar agile stories, it becomes as 'natural' for the Dev team to addresses operational features as it is to address end-user features, identifying the Ops team as a set of users with real needs and requirements.
Pre-Requisites for Success with the Draft Run BookIt is important to recognise two related but distinct benefits of run book collaboration:
- The software becomes more operable and so works better in Production.
- The Dev and Ops teams have collaborated on the design and execution of the software.
In fact, I recommend to organisations that they throw away the draft run book in order emphasise that the purpose of the run book collaboration is to increase communication and trust between Den and Ops, not produce a giant document which might replace or prevent automation and monitoring.There are a few other pre-requisites for success with run book collaboration:
- Talk about operational features, not Non-Functional Requirements
- Make the Dev team (or better, the Product Owner) partly responsible for the operational effectiveness of the software system
- Encourage and persuade Ops teams to engage with Dev teams during the development phase. This means Ops folk using agile practices like pairing on logging implementations, and attending stand-ups, planning meetings, and retrospectives
- Having technical management sufficiently savvy that they see value in the act of collaboration even if the artefact of collaboration (the draft run book) is discarded at the end of the process