HealthKey Technologies CITO Saeed Elnaj writes that classic disaster recovery strategy must now evolve beyond IT into an "all hazards" planning process.
We live in a world full of risks, threats, incidents, and disasters. Severe weather events are increasing in frequency and we also face more frequent and disruptive cyberattacks that are impacting gas pipelines and supply chains. We are confronting earthquakes, mass shootings, and a pandemic that is receding but not ending. And now there is the war in Ukraine that is putting pressure on the food and energy supply chains. Adding to the mix is the economic uncertainty due to high inflation not seen in over 40 years. So, how should technology leadership change to be part of a process to confront such risks, threats, and disasters.
Traditional Disaster Recovery Planning
Disaster recovery (DR) was seen as a key function of IT. IT leaders focused on actions to stop disasters from happening – the threats, vulnerabilities, risks, and potential disasters that could impact IT operations and mission critical systems. However, threats and hazards that did not seem to directly impact IT operations were the domain of other risk management teams.
IT practitioners generally focus on DR from an IT perspective, such as how to ensure that IT systems can have a low or zero downtime. We look at Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for each system. We create risk mitigation heatmaps that include a listing of identified risks with occurrence likelihood and what would be the impact if the risk materializes. Then we calculate possible costs and develop methods for mitigating the costs of such risks. Most of what we do as IT leaders has been within our IT department and domain, possibly with some interaction with other organizational units that are responsible for the management of wider set of threats and risks.
We apply methodologies for measuring and managing information risk, such as the Factor Analysis of Information Risk (FAIR) that is covered in the book Measuring and Managing Information Risk, by Fruend and Jones. Their methodology provides a framework for understanding, measuring, and analyzing information risks. It provides techniques for how to do quantitative risk analysis and addresses risk theory, risk calculation, modeling, and how to communicate risk within the organization.
But the pandemic and the more recent man-made and nature-induced disasters are forcing organizations to rethink their approach to risk, threat, and disasters management.
Time for a New Approach
One major lesson we have learned from the COVID-19 pandemic is that a more holistic enterprise-wide risk and disaster management approach is needed, an “all-hazards planning” approach that goes beyond IT and traditional risk and vulnerabilities. Organization need an approach that does not focus on one specific hazard or risk, but on everything and anything, including a methodology for the handling of risks when they actually become incidents, crises, or disasters.
One of the methodologies and tools widely-used for enterprise risk and disaster management that can be applied to the exploration of “all-hazards” planning is the five stages of crises management. This methodology includes two stages that need to be developed prior to incidents or disasters, Protection and Prevention. Post incident, you need to have a process in place to develop the details for the response, recovery, and resiliency stages. These are the Survive, Revive, and Thrive phases.
Figure 1: Five stages of crises management
Under the Prevent and Protect stages, we apply the heatmap risk assessment tool to identify, assess, and mitigate the risks and vulnerabilities. But this is where things get complex. Which risks and threats should we account for? Which specific disasters should we plan and budget for? This is where art, science, past experiences, methodologies, literature, and events converge, which can lead to difficult assessments and decisions to make. At the end of the day, it boils down to how much money an organization should invest in the prevention and protection phases, and how much it should set aside for managing incidents or disasters when they strike. None of these questions have easy answers, but luckily there is plenty of guidance available in seminal books on this topic.
Build Robustness and Resiliency
In his influential book The Black Swan: The Impact of the Highly Improbable, Nassim Taleb argues that black swan events, such as 9/11, the 2008 financial disaster, and now the pandemic, are so rare that even the possibility that they might occur is unknown. But they tend to have catastrophic impact when they occur. More importantly, the classical probability rules and prediction techniques, such as FAIR, do not apply in these cases. Applying statistical models based on the past will more likely make organizations more vulnerable to the black swan events. Taleb says black swan events teach us that the past is not a good guide for managing future risks. Black swan events cannot be predicted, and therefore, organizations should build robustness or resiliency so that they can survive stressful incidents.
Another equally important book on organizational risk management is Michele Wucker’s The Gray Rhino, in which she examines how organizations fail to recognize and act on obvious risks. Wucker argues that gray rhino events are more important than black swans since gray rhinos are highly probable, with high impact, but organizations often fail to act on them. She argues that the identification and fear of black swan events should not diminish our ability to manage the obvious threats in front of us. She advocates focusing on high-probability, high-impact events that are always imminent but never recognized.
By Nick Shevelyov
Minimize Disaster Consequences
Additional insights can be gained from Juliette Kayyem, a senior lecturer at Harvard's Kennedy School of Government. In her book The Devil Never Sleeps, she makes a number of important observations. First, no matter how much we try to prepare and prevent disasters, we must accept that they will eventually strike. Second, we need to have a well-established and well-tested Incident Command System (ICS) in place. Third, the right measure of success for disaster management is not that we can avoid them, but rather how less tragic or impactful they can be. Kayyem writes, “We must now view success through the lens of...consequence minimization.” Fourth, planning for catastrophic events requires thinking beyond the notion of a last line of defense, and the need to create a “layered line of defenses.” As IT practitioners and engineers, we tend to create failsafe systems with traditional DR components and reliance on few devices or instruments. The proper approach is to create “a series of procedures and training based on the irrefutable assumption that something will fail, and that our goal is to help them fail more safely.”
- Disasters are happening at a higher frequency with higher negative impact on society and organizations. There is no escaping it and technology leaders need to be proactive in addressing such risks and disasters. The classical DR and risk assessment techniques are no longer sufficient for IT leaders
- To address these more devastating disasters, technology leaders need to be members of an organizational risk and disaster management team that is proactive and that considers risks beyond traditional IT. The idea is to be involved and active in the all-hazards planning process.
- Evaluating black swan and gray rhino events as well as the traditional IT risks should be included in the risk and disaster management effort that uses FAIR and five stages of crises management. IT leaders should be part of this exercise despite what might appear as a non-IT enterprise risks.