About the Author:
Stephen M. Dick – VP Cloud Engineering (DevOps | SRE | FinOps). Stephen leads and matures modern Cloud Engineering teams that accelerate software releases (DevOps), ensure highly reliable systems (SRE) with cost optimized cloud infrastructure (FinOps).
At 8:45pm on a warm summer evening at a Data Center just outside of Washington D.C, crickets sang through the night. At 8:46pm a gentle breeze swept through the trees. At 8:47pm, the Data Center suffered a loss of power and multiple redundant power systems failed. This caused emergency personnel across the country and 10,000 other Salesforce customers to lose access to their online business applications hosted at the Data Center. The disruption of service (known as an ‘incident’) lasted almost an entire day. Some customers lost data, permanently. Marc Benioff issued apologies on Twitter and over the next few weeks, executives called 472 customers to explain what happened. Social media produced dozens of internet memes as business users struggled with the sudden loss of one of their most critical applications. Analysts estimate Salesforce lost $20M of revenue.
Shortly after, I stepped into a new role at Salesforce with the directive of radically improving the reliability of Salesforce products, maturing how the company responded to large-scale incidents and integrating acquired companies into a unified reliability practice. We sought insights from industry titans and discovered these 3 secrets of modern incident management:
1. Measure How Incidents Impact Revenue and Productivity
In the digital world, customers pay for uptime, and businesses pay for downtime. When technology fails, the stakes can be high. Downtime can harm a company’s reputation, erode customer trust, and cause significant financial impact. But there are hidden costs too. Productivity is hampered when multiple teams spend weeks in damage control, arranging customer calls to rebuild trust after incidents. Customer Trust, like a bank account, can be drained by one large withdrawal of trust or many smaller withdrawals through multiple, smaller incidents.
The State of DevOps research has indicated for years now, time to restore service is a critical predictor of business performance. Yet Mean Time to Resolve (MTTR) alone fails to capture the ‘productivity drag’ incidents, of any volume, that can impact a business. For this reason, the DORA research team indicated that “reliability excellence” is equally a critical measure of software delivery performance in the latest 2022 State of DevOps report.
Reliability excellence includes the proactive side of SRE like self-healing infrastructure, Poka Yoke concepts, well defined distributed systems with redundancy and failover. It includes the reactive elements of SRE like Incident Management, which requires ongoing training, development and a commitment to excellence.
These are some broadly tracked metrics used at Salesforce to ensure a deeper understanding of how reliability impacts the business:
Why it Matters
Revenue Impacted per Incident
This metric provides an indication of how much revenue incidents ‘touch’ over time and provides a gauge of the overall trust deficit that could impact the bottom line.
SLA Claims per quarter
This is a hard dollar measure of how incidents impact the bottom line through discounts, concessions or SLA rebates a company makes per quarter.
Number of Post Incident Customer Calls
Large incidents can result in a high number of customer calls intended to rebuild trust and provide insight into what happened. Measuring this provides insight into the productivity drag incidents can have on your teams.
2. During an Incident, One Person Needs to Lead
During “peace time,” or business as usual, decisions are typically made by a hierarchy of managers. But during an incident, the decision-making style needs to pivot to a “war time” mentality. Incidents require a different decision-making style, one that is clear, well-tested under pressure and deeply connected to the details of the incident. Because of this, the decision-making process at a company needs to change when an incident occurs. The org chart needs to evolve into a ‘War time’ scenario where everyone in the business can be enlisted to support incident resolution efforts.
The Incident Management System (IMS) provides a structured framework for managing critical incidents and has evolved into a national standard across the United States for managing various public safety emergencies. Software companies use IMS to manage customer-impacting events, ensuring a swift response when services are degraded or unavailable. The system calls for the use of an ‘Incident Commander’ role. One person at the company, well trained, and armed with techniques used by FEMA and other emergency responders to drive the collective resolution efforts at a time when many businesses are the most vulnerable. In this ‘War Time org chart’, everyone in the company has a dotted line to the Incident Commander.
The Incident Commander role and the related charge of how a company behaves during an incident is nuanced and can be a difficult transition for some. It requires context, facilitated discussion, ongoing training, reinforcement and buy-in from all levels.
3. Seconds Matter
During an incident, the ‘Game Clock’ is ticking. It starts when an incident is declared and stops when it’s resolved. Every second matters – to your customers waiting on your product and to your business, which may be hemorrhaging revenue and reputation with each passing minute. Because of this, incident responders need to be trained to communicate in a way that compresses minutes into mere seconds.
One secret of how to do this; the venerable ‘CAN’ Update.
‘CAN’ stands for:
- Conditions: What’s currently happening in the incident
- Actions: what are the active swimlanes and who is on deck to drive those
- Needs: what open needs are there to ensure a successful resolution
Providing updates using this format provides a consistent, streamlined way help teams resolve incidents more effectively for several reasons:
- Clarity and focus: CAN updates provide a concise, structured snapshot of the current state of the incident. By clearly outlining the conditions, actions, and needs, these updates help maintain focus on the most critical aspects of the incident, ensuring that all team members are on the same page
- Efficient decision-making: With a shared understanding of the current conditions, actions taken, and outstanding needs, decision-makers can quickly assess the situation and make informed decisions regarding resource allocation and next steps
- Ongoing progress tracking: Regular CAN updates allow the Incident Commander (IC) and other stakeholders to monitor the progress of the incident response. By staying informed about the current state of the incident, the IC can better manage the overall response, adjust plans as needed, and ensure that the team stays on track to resolve the incident quickly.
Like all changes to communication patterns within a business, training, repetition and the setting of clear standards with ongoing routine assessments are required to ensure the behavior sticks.
On May 9th, 2016, business users, 911 operators and sales agents at a regional train operator learned that even Salesforce, the premier Cloud company can have disruptions to its service. A culture of reliability emerged at Salesforce in the aftermath, including large-scale investments in incident training, ongoing assessments and education. Everyone, including the C-suite, would come to expect excellence during incident execution.
These lessons continue to be learned in other industries, other companies, and other circumstances and now you know some of the hard-earned secrets: Companies that invest in measuring the impact of incidents carefully see the need for training and development, tabletop exercises and game day drills that hone the skills for all personnel involved in incidents. These are the businesses that have had the most success navigating publicly scrutinized events when trust is most on the line.
About the Blackrock 3 Partners:
Collectively, the Blackrock 3 team has over 100 years of experience in the Fire Service, Law Enforcement, Counter-Terrorism, Anti-Proliferation and Critical Infrastructure, operating around the globe. Trusted by clients that rank in the top 10% of the Fortune 500, they have trained, evaluated and exercised thousands of Incident Commanders (IC) and Subject Matter Experts (SME) working in Global Command Centers, Emergency Operations Centers, Regional Operations Centers and War Rooms. Those incident responders staff functional teams including Site Reliability, Computer Security Incident Response Teams, Mission Critical Support, Unified Command, Operations and Engineering/Technology (Network, Database, SAN/Storage, Server, Automation, Applications).