Operability is all about customer experience and service viability, not short-term feature delivery. Address operability early on!
Continuous delivery needs food engineering practices, fast feedback from deployment practices, re-aligned architecture & team ownership of software and services.
Use 5 operabilty techniques:
Modern event-based logging - focus on the intent, not the tools
Use Run Book dialogue sheets
Lightweight User Personas for Ops
Multi-dimentional engineering assessment: Team Health, Deployment, Continuous Delivery, Flow, Operability, Testing.
Operational aspects are also features
Make space for learning and sharing - promote good work and help to develop skills in speaking.
Involve teams in improvements - co-create engineering standards
Angie Jones (Applitools) - “ The build that cried broken: building trust in your continuous integration tests”
You can learn a lot from fables!
The shepherd boy and the wolf - regarding testing logs: when you are getting messages that say that the system is always broken, but it works fine you stop caring and do not notice when it breaks for real - make sure you are getting only relevant error messages.
The fox and the goat - look before you leap: if your end goal is fast feedback, do not make the mistake of caring just about automating everything whether it makes sense to automate it or not - make sure you do not forget the end goal.
The lioness and the vixen - quality over quantity: make sure you have relevant tests, it is not the number of different tests that matter but the quality of them
Manage flaky tests - the ones that sometimes fails - sometimes pass:
source of truth, shortcuts
Magician: don’t value new code over the one you have already
Hannah Foxwell (Pivotal) - “Reliability Engineering for Humans“
#HumanOps - the wellbeing of human operators impact the reliability of the systems.
Failure is normal - we need to change our focus from preventative approach to being prepare when failure happens - it will happen.
Blameless incident review: we don’t call it postmortems anymore - nobody died, failure is normal.
SLIs, SLOs and Error Budget - acceptable amount of failure. You do not need to be the size of Google to use those practices.
100% reliability is not your target. So what is? 99%? Set your service level objectives and set them realistically.
Enforce your error budgets, make sure everyone understands the error budget and how it works.
Psychological safety is important. Belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes. Teams were measured on psychological safety, error rates and team performance. Higher psychological safety correlated to higher error rates, however, higher error rates correlated to higher team performance.
Toil - anything that has no enduring value. Toxic toil is the one that wakes you up at night, ruins evenings and weekends, interrupts your work and distracts you.
Nigel Kersten (Puppet) - “WHY ARE WE ALL SUCH HYPOCRITES WHEN IT COMES TO DEVOPS?“
We talk about empathy between Dev and Ops, but what about the rest of the business? Excutives, marketing, finance, sales?
Different perception in different side of the business: c-suits, management and team usually have different view of the business, the higher up the more optimistic it gets
Impoverished communication leads to generalizations and inaccuracies:
Up and down the communications ladder - communication upwards isn’t accurate
We generally try to report good things, so if anything is not going well it is often does not get reported to upper management and they think that everything is going well until it does not.
Authenticity - the more you are likely like yourself, the more employees are likely to trust you and understand you.
Optimism bias - we end up looking at the wrong attributes. Biologically, our brains are not made for office work, brain takes shortcuts that are not useful in our way of life anymore.
Interpersonal distance increases optimism bias
Dial down management cynicism - we are all responsible. Managers are not the only hypocrites - we often delude ourselves that our software are better than it is.
Overall, the main theme for both days seem to be the importance of psychological safety and good communication skills - Maslow pyramid of needs was mentioned in most of the talks I managed to see as well as blameless culture, clear and precise communication up and down the chain of command. Also, AWS, operability and the importance of good logging practices were some other common themes.