Leadership Development

Two Proposed KPIs to Measure IT Resiliency

By Saeed Elnaj

Jul 27, 2016

Leadership Development

Imagine that your data center was located in the path of Hurricane Sandy in 2012 and that you had a great backup solution with excellent high-availability of your mission critical systems, i.e. your ERP, CRM, email, and revenue generation systems like charging and billing, and ecommerce. You also had a disaster recovery (DR) solution in place, but not all systems were in full DR mode.

Or imagine that your organization was hit by a Ransomware attack (like the Hollywood Hospital), and you were not fully aware that not all systems had the “right protections.”

Suddenly, your organization is exposed to all kinds of losses and degradation to its basic operations. If your mission critical--or even your back office--systems fail, your organization might not be able to generate revenue or meet its financial and legal obligations, resulting in major losses or even impacting its future competitiveness.

Immediately after such a snafu, we start asking questions such as, “What could I have done to prevent or to lessen the impact of such an incident?”

The necessary precautions that can be taken to prepare for such incidents is to answer two fundamental questions:

How resilient are our IT systems and services?
Are they resilient enough?

In fact, these two questions, should be asked by the CEO and the board, and if they are not asking these questions, they are not in full control of enhancing or managing the shareholders’ value.

A great starting point to talk about operational resiliency is the Carnegie Mellon's Resilience Management Model CERT-RMM document. It defines operational resiliency and provides a methodology and a model that facilitates the establishment of operational resiliency management processes. However, the CERT-RMM document was not designed to answer the questions about how to measure and gauge resiliency. To answer these two resiliency questions, I propose new Key Performance Indicators (KPIs) that attempt to measure operational resiliency of IT systems and services.

#1. The IT Resiliency KPI

While the process of calculating the resiliency KPI (rKPI) will initially seem complicated and highly mathematical, it is rather simple. Here are the steps for constructing the rKPI:

Create an inventory of high-value IT assets.
Additional assets, with medium to lower value, can be included at a later stage following the same approach. Your portfolio management team will be perfectly capable of preparing this spreadsheet.
Rank the assets between 1 and 10 in terms of impact on revenue.
Let’s call this parameter the Asset Ranking (AR). It is possible and recommended to be more scientific about how to rank the IT assets. It can be done by calculating the dollar loss per day, week, or month that the organization could encounter if the asset were not available. The total loss is then added up for each asset and each one can be ranked mathematically by dividing its unavailability cost by the total unavailability cost of all assets.
Identify the full replacement value for each asset.
Let’s call this parameter the Asset Replacement Value (ARV). This is the cost to replace or recreate the system in case it is completely destroyed. This shall include the software, hardware, operating system, monitoring and security tools and the cost of the facility (datacenter) that will be required to operate the system end-to-end. Many of these costs and numbers can be obtained from the annual IT OPEX budget.
Identify the systems that have high-availability, DR, and security protection measures and the financial cost to make them resilient.
Let’s call this parameter the Asset Resiliency Cost (ARC). Cloud hosting can play a role in determining the ARC, if such services are used, but you need to include the SLAs and the DR parameters into the calculation per your cloud hosting agreement.

Finally to calculate the rKPI, multiply the Asset Resiliency Cost (ARC) for all resilient IT assets by the Asset Rank (AR) and adding up the values for all resilient assets. Then divide this value by the Asset Replacement Value (ARV) multiplied by AR and adding it for all assets (resilient and non) producing a percentage value as shown by the formula below:

KPI library

The higher the percentage of rKPI the more resilient an IT organization is.

#2. The IT Resiliency Exposure Value

The resiliency KPI by itself is a useful but insufficient measure to fully understand IT resiliency. To better understand the rKPI value and to answer the second question “Am I resilient enough?”, rKPI needs to be juxtaposed with a financial indicator that would interpret the rKPI value.

The CIO should know what it means to be 30%, 40% or even 90% resilient, and is that enough? More information is needed to understand the impact of this KPI on the enterprise in specific dollar terms. This is done by calculating what can be called the Resiliency Exposure Value (REV). REV can be calculated as follows:

Identify the full Asset Replacement Values (ARV) for each non-resilient IT asset which was done earlier during the calculation of the rKPI.
Calculate the Loss of Service Cost (LoSC) per day for each partial or non-resilient asset and multiply it by the Replacement Time (RT), i.e. the number of days required to fully replace the asset. This should include the estimated time it takes to order, install, and configure the hardware, software, data recovery, and security tools to fully operate such an asset.

The Resiliency Exposure Value is then calculated by adding up the ARV values with the LoSC value multiplied by the RT value for each partially or non-resilient asset. The final formula looks as follows:

IT performance metrics

Now that exposure value is calculated, the rKPI value is better understood and can be better managed.

REV can also be divided by the average organization’s revenue over the same period of time to determine the percentage loss, which puts the financial loss in better context. If the exposure value or percentage are not acceptable to the executive team or to the board, then a new rKPI target can be established and plans and investments can be put forward to reduce the exposure value and improve the rKPI.

While the above formulas are highly mathematical, in reality, the values are far from being exact and quantitative. As it is in many fields such as economics, where exact math does not capture the full reality, here too with assumptions and approximations, we get measurements that are not exact but rather indicative and far better than “hunches and gut feelings”.

As inaccurate as they may be, the Resiliency KPI and Resiliency Exposure Value are good tools to answer the questions of “How resilient are our IT systems?” and “Are they resilient enough?” especially in these times where IT assets are exposed to increasingly higher risks. As Hunter and Westerman advise, “don’t let the search for the perfect number keep you from using a useful, if imperfect, metric.”

disaster recovery

Written by Saeed Elnaj

Saeed Elnaj is the CIO for RELI Group. Earlier, he was Chief Information and Technology Officer (CITO) at HealthKey Technologies, and Vice President of IT and CIO at the National Council on Aging. He has over 25 years of IT experience with industry leading organizations that also include Oracle, Ericsson, and Project Concern International.