Recovery: The Long and Short of It

Warning: The following is a somewhat geeky dive into the recovery resilience attribute. It is the first of a semi-periodic series of more technical topics investigating the subtleties of defining, quantifying, and understanding resilience. Your mileage may vary.

The ability of a space system to regain some or all of its initial capability following loss due to a threat after some period of time is the attribute of recovery. The duration of time prior to recovery of the capability is known as the recovery time and is illustrated in Figure 1. Recovery can be quantified using two parameters: the recovery time, and the fraction of lost capability that is recovered. The time-normalized value is from zero to one. Though the calculation is very straightforward, the selection of the time frame over which it is made is not, and there’s the rub.

Figure 1: Recovery follows a loss of system capability due to a threat. The recovery time is the time elapsed prior to steady-state recovery occurs.

In truth, there are two parts to the valuation of a recovery metric. The first is associated with the near-term impact to the system and its users. The second is the long-term, steady-state impact. The selection of the dividing line on the time scale irequires careful selection. Often this boundary represents the endpoint of the “operationally relevant timeline” where the immediate utility to the user has decreased to zero. That is, any additional recovery of capability does not materially benefit the user as the critical time period has passed. Beyond this time, though, there is the long-term benefit to users. Over this much longer period (perhaps orders of magnitude) the duration of the recovery time becomes inconsequential and only the final recovered capability value is of interest. Viewed thus, it becomes clear that the choice of the boundary time is important in both valuing and quantifying the recovery metric. As the boundary point moves to the right towards infinite time, the overall impact of the small outage time converges to zero.

Figure 2 demonstrates this concept. Two time periods are represented, partitioned by a “boundary time” at t = T. To the left is the operationally relevant time period from t0 to T, during which there is user urgency and value to having the capability restored as quickly as possible, minimizing recovery time, as after time T there is little further value as the need no longer exists. But to perhaps a broader class of users the second period is of greater interest, the long-term steady state period to the right of time T. In this case the relatively short service interruption starting at t0 is not important. However, the long-term steady-state recovered capability value is. The difference between 50% and 100% recovery could be highly significant to future operations.

Figure 2: Valuing both recovery time and percentage recovered in the near term vs. simply percentage recovered in the long term

We can further amplify on the importance of choosing the boundary time to obtain the desired results through a simple mathematical example. Referring back to Figure 2, assume that the boundary time T = 1 to normalize the calculation to a maximum of 1, and t0 = 0. The recovery time is such that 80% of the lost capability is recovered at t = 0.5.

If the recovery metric RRV is defined as the product of the percent recovered capability and the period over which it is recovered up to time T, then for this example RRV = (0.8)(1 - 0.5) = 0.4 for the near-term period up to t = T. However, as we lengthen the calculation period as T goes to infinite time, RRV simply converges to 0.8. If T is extended by a factor of 100 and normalization is repeated then RRV = (0.8)(1 - 0.005) = 0.796. So is the recovery metric 0.4 or 0.796? It clearly depends on the inherent “value” to the user(s). If truly some users see no additional value following a period of time T, then the value is 0.4. But if the users are mainly concerned with the long-term system capability, then RRV is essentially 0.8. In truth, some users may be in the first category and others in the latter, in which case both are true. This is why it can be difficult and sometimes inaccurate to try to boil resilience values down to a single number. Many systems support multiple missions or services and provide them to many different users simultaneously, so the impact of a threat may vary considerably depending upon the user.

So to sum up, the calculation of recovery values as part of resilience assessments must be considered within the context of the user and mission requirements to fully and clearly articulate the impacts of threats and threat mitigations. While the choice of a “boundary time” may appear to be arbitrary, careful consideration of these requirements should guide the designer and analyst resulting in meaningful results.

Threat Mitigation Considerations

Top Priority