That’s a wrap! Gremlin hosted Failover Conf 2: Fail Smarter on April 27, 2021. In attendance were over 500 SREs, developers, sales engineers, product managers, DevOps experts, C-level execs, and other reliability pros from around the globe! This year’s conference included discussions around the future of DevOps, strategies for building reliable teams, analyzing human error to create better systems, and more. FireHydrant was excited to be one of the sponsors of Failover Conf again this year. Here are some of the ideas from Failover Conf that we’re still thinking about!
Emily Freeman, author of “DevOps for Dummies,” talked about “What’s Next for DevOps.” She pointed out that: “one of the unintended consequences of DevOps has been this pressure, or belief, that everyone has to do everything,” but in fact, “DevOps is not a methodology that encourages everyone to do everything. We all have specialties and we should lean into our strengths… while leaning on our colleagues in our team to buttress our individual weaknesses.” She points out that practicing accountability in times of an incident means you have a respect for the work of your peers; an awareness and respect that allows everyone to come to the table as equals.
Jeff Smith (Director of Production Operations at Centro) and Matt Stratton (Host of the Arrested DevOps podcast) led a fascinating fireside chat. Matt pointed out that “things like security and reliability aren’t features that customers are necessarily going to request--but they’re absolutely things that your customers expect and need from the product.” Jeff replied that when we think about security as an actual feature, the message becomes clearer that work on security needs to be prioritized. They also discussed aligning goals across teams to make sure all work is measured against the total product, pointing out that DevOps teams can proactively engage with other teams, and offer their expertise, in an effort to to prevent problems before they occur. Matt compared Ops teams to Corporate legal departments, saying “nobody knows about all the times legal keeps your company from being sued. It’s the same for Operations teams in that keeping things running smoothly is a non-event. So making Operations work more visible is important, and makes Ops teams part of the product lifecycle.
Laura Santamaria, Developer Advocate at LogDNA, shared insights about human error, one of the most common causes for incidents around the world. What do we blame when we talk about human error? How do we say that accidents happen, and that mistakes are made? Laura pointed out that it’s OK to make a mistake. Whenever you see evidence of human error during an incident, there was probably a failed process behind that error! There may be a mitigation process you can build to avoid having the same problems in the future. She advocates for using Guidelines and Guardrails, defining “Guidelines” as best practices; steps to follow that make the most sense. “Guardrails” are those checks and balances that keep you running on the same path even when something goes wrong.
Our own CEO and Co-Founder, Robert Ross, delivered a talk about the very valuable practice of learning from failures in incident management. He shared stories from his experience as an SRE and described some best practices he’s learned along the way.
Check out the video of Bobby’s talk below!