The Resilience Value of Path Diversity
One of the revolutions in the New Space Era is the investment in highly proliferated or distributed space systems, particularly with regard to the space segment. There are a number of drivers that have led to this change in both the government and commercial sectors. Most of the proposed and deployed proliferated space layers are in a low Earth orbit (LEO), thus the general moniker of P-LEO (Proliferated LEO). LEO systems provide considerable communications link improvements including lower signal loss to terrestrial users and reduced latency. The greater number of satellites also insures against the loss of any single satellite in the constellation, leading to lower per-satellite costs due to reduced on-board redundancy yet still yielding greater system-level reliability. The inherent value of each individual satellite is reduced to the point where it can be lost with little to no system performance impact. Likewise, the system-level resilience to external threats also increases for the same reason. In resilience terms, the proliferation and distribution of capability results in increased architectural robustness at the system or space segment level.
Space systems are largely information systems and are increasingly implemented as extensions of terrestrial data networks to space. As more satellites are added to the constellation this becomes literal: if many of these satellites can be interconnected via space-to-space links (intersatellite links) then the number of active nodes and paths in the network increases to the point that a true mesh is created. In the case of the available data paths, this increase can be somewhat exponential. This provides added reliability and resilience due to the increased ability to route data around failed or disabled nodes (satellites) in the constellation. This approach is usually referred to as “path diversity” and has been used to great benefit for decades to greatly enhance the reliability of terrestrial telecommunications systems.
While there is no argument that path diversity is a means of increasing resilience, there is still the question of how much diversity is required to provide some specific resilience level. This will ripple into the requirements for number of satellites (beyond a minimum to meet coverage and capacity requirements) and the number of interconnections among them. Depending upon the answers desired, this calculation can become very complex as the system size grows. But given a homogeneous space topology of N satellites, each with M interconnects, mathematical analysis can provide an answer. But once again, it starts with the right question. Most of the proposed P-LEO systems are designed to support a wide range of users and missions simultaneously. So analyzing the impact to a single user-to-user communications link/path is much different than identifying the impact of losing one or more nodes to a community of users. But these are important questions to pose and answer, lest the system be either over- or under-designed.
A simple example shows how this problem can quickly become complicated. Figure 1 shows a very simple LEO satellite system topology in the shape of a dodecahedron, a 12-sided solid. Twenty satellites occupy the vertices, with each of the 30 edges representing inter satellite links, three per node. User access points are at A and B. Theoretically data may be transmitted from A to B via any available path from source to destination (and vice versa). A path length of five is shown in the diagram, which is also the minimum path length for these two antipodal nodes. For a LEO constellation this is obviously a snapshot in time as the satellites orbit and users are handed off from satellite to satellite. There is an assumption that each node contains a routing element to direct data to its destination as in conventional networks.
If each link can support a continuous bidirectional data rate of 1 Gbps, and each node can continuously route a total of 6 Gbps among its three ports (3 Gbps in and out), then the total system communications capacity at any given time is C = 30 * 2 * 1 = 60 Gbps. Ideally this traffic is more or less uniformly distributed across the constellation, although in practice this is unlikely to be true since users are not evenly distributed across the globe. While the minimum path length from A to B is 5, it can be shown that many longer paths exist and for this configuration the total is 594, through path length 15. These are restricted to “simple paths” meaning the data only encounters each node no more than once. Given the symmetry of the constellation, this is true for any pair of antipodal nodes such as A and B. Given these simple assumptions, the impact of lost nodes in the system due to an external threat can be studied and the decrease in resilience estimated.
Consider first the impact of the loss of a single node, excluding A and B. At the system level the three links adjacent to the lost node are also now unusable, meaning a loss of their data carrying capacity of 3 * 2 = 6 Gbps. The system capacity has then been reduced by 6/60 =10%, for a resilience value (robustness) of 0.9. This will be true for the loss of any random node in the system, assuming all other links are at maximum capacity and cannot absorb the loss. The second impact, from a routing perspective, is that the paths that the disabled node supported are also rendered unusable. In the worst case, consider the shortest path length of 5, in which only 6 simple paths between A and B exist. If any of the nodes adjacent to A or B are disabled, fully 2 of the 6, or 1/3 of the total paths become unavailable. It can be argued that this encompasses only 6 of the remaining 18 nodes for a given A and B, but if every node is an access point and thus is A or B for some group of users then someone will likely be affected in this manner by any nodal outage. And while this scenario may not completely dissociate any pair of active nodes, it may result in other impacts such as increased latency due to the need to route through a longer path length.
It should be noted that in this simple example all nodes are active in the baseline scenario. In practice more nodes/links/paths may be implemented to provide margin such that some number of nodes can be lost without significant impact to the system users. This example simply illustrates the impacts to resilience of this loss. If the required resilience in the previous example is 0.75 and the node loss results in a resilience of 0.9 then there is no real consequence. But it is important to understand how much margin, if any, is required and how to affordably implement it.
When attempting to estimate resilience, the system or mission capabilities of interest must be first identified. The resilience of each to the threat range is then analyzed. These capabilities may include system capacity and/or bandwidth, latency, supported user data rates, and connectivity. Each may be impacted by various types of threat effects: loss of a node, loss of an inter satellite link, or loss of a path between nodes (multi-link). An example of each can be shown for the constellation in Figure 1:
1) Loss of nodes: Previously it was shown how node loss can result in proportionally reduced system capacity. Here the resilience can be expressed simply in terms of architectural robustness, such that R = (Number of active nodes / Number of total nodes). So, losing one node out of the total of 20 yields a resilience of R = (19/20) = 0.95. By itself this provides some measure of impact, but does not capture all of the effects on the user. This approach does not capture the node’s more global impact to data transfer and so underestimates the resilience (as shown prior).
2) Loss of links: Similarly, the loss of links implies loss of simple and/or complex routing paths in addition to system capacity. Strictly in terms of link loss, the estimated system resilience could be expressed as R = (Number of active links / Number of total links). For example, losing one node causes the loss of three links of a total of 30, yielding a resilience of: R = 27/30 =0.9. Alternatively, the compromise of a single link between two adjacent satellites would result in a higher resilience number: R = 29/30 =0.967. Again, this is a simplistic approach not necessarily providing a high fidelity answer. If the intention is to measure the impact of resilience based on path diversity then the link loss must be translated into path loss first. That is a much more complex calculation and that missing step results in an over-simplified metric which does not address connectivity directly. Link loss could be used to derive two metrics, one for capacity and one for connectivity, each resulting in a separate resilience value, but only if path loss is derived from link loss to faithfully represent the degradation to the network.
3) Loss of paths: As the fundamental purpose of the transport layer is to connect users, perhaps considering the impact to the routing paths throughout the constellation is a better estimate of system resilience given that the advantage of this system topology is to increase resilience through path diversity. It has already been established that for this example the chosen topology provides ample path diversity among any two nodes. The real question is how many links and/or nodes must be disabled before connectivity between any two nodes is disrupted? Clearly in this example the loss of a single node or link is insufficient to cause this to happen. It was shown previously that the loss of a single node adjacent to A or B represented the worst case for the connectivity case and results in a loss of 1/3 of the six paths, leaving four remaining paths. This does not result in an outage. Nor would the loss of two of the three adjacent nodes. Only the loss of all three (adjacent to either A or B) would isolate A from B. So, the resilience to a loss of either one or two adjacent nodes is R = 1 (assuming negligible time to detect the loss and route around it).
Likewise, loss of another non-adjacent node or link can only affect a small number of paths, and once again the resilience is 1. The resilience is 1 until the threat severity, in terms of number of disabled nodes/links, reaches a threshold in which node-to-node connectivity is disrupted. This results in a real reduction in the number of simultaneous users that the network can support. In addition, for this scenario the impact upon the number of available paths is not the same for all nodes. Some nodes will suffer a greater impact based upon their adjacency to other nodes in the system. For this reason, simply calculating the number of paths disrupted by the loss of nodes or links is not indicative of the actual impact to system connectivity and user utility. Using the example of losing a single node adjacent to the antipodal node and considering only the number of simple paths between the two nodes, if: R = (Number of available paths / Total number of paths). For this scenario R = 2/3 =0.667. In fact, for that specific case it can be shown that 1/3 of the paths will be lost regardless of path length, so for the cumulative paths up to length 10, R = 60/90 =0.667. Though this is a “worst-case scenario” for the location of the disabled node, the impact on the communications between A and B is unaffected since there are four additional paths of the same length available. Resilience is actually 1 by a measure of utility rather than simply accounting for percentage of available links. In effect, the margin has simply been reduced but is still sufficient to sustain normal operations.
Instead, consider a connectivity metric that measures the number of antipodal pairs of nodes that are connected in the presence of a sophisticated threat vector. For any choice of lost node, the resilience is 1. For the worst-case threat of all three adjacent nodes (to A or B) being lost (Figure 2), the resilience could be shown to be: R = (Number of active antipodal pairs connected / Total number of antipodal pairs). Using this metric, R = 9/10 =0.9 for the loss of three adjacent nodes to A or B. This approach could be extended to any pair of nodes regardless of level of adjacency (path length) to create a more comprehensive result. Note that the loss of 3/20 = 15% (0.15) of the nodes only results in a 10% loss in resilience based upon the actual connectivity impact to the constellation. Obviously, this is a worst-case scenario for a single user pair at a specific window of time, with maximum distance between source and destination nodes. As the constellation orbits, the impact of the lost nodes changes based on the rerouting of user data from A to the next node in view, for example. But given that sometimes the most achievable result is a bound on minimum resilience, this may suffice.
This example illustrates how resilience can be based upon a key network capability such as path diversity. And in aggregate the impact to the system can be more severe, affecting multiple pairs of users simultaneously as the number of disabled nodes or links increases. Clearly a more comprehensive and mathematically rigorous analysis is possible using high fidelity system models, actual threat characteristics, and a fair amount of computing power, particularly for large constellations. Based upon the nature of the suggested proliferated LEO architectures providing a mesh networking capability, the most promising approach to estimating resilience to threats to the space layer appears to be a method that considers both the loss of paths as well as the capacity associated with them, perhaps over a worst-case time frame that represents the system behavior to an individual user.
This is not in any sense an intractable problem, but does require some study and nuance to achieve acceptable and useful estimates of resilience and to relate the impact of path diversity to user satisfaction in a threat environment. Back of the envelope calculations may help bound the problem, but may still result in over- or under-design if the implications are not well understood.