The Internet has seen tremendous growth in the past decade and has now become the critical information infrastructure for both personal and business applications. It is expected to be survivable/resilient to any kind of failure as it is essential to our daily commercial, social and cultural activities. Service disruption for even a short duration could be catastrophic in the world of e-commerce, causing economic damage as well as tarnishing the reputation of a network service provider. In addition, many emerging services such as Voice over IP and VPNs for finance and other real-time business applications require stringent service availability and reliability.
Unfortunately, failures occur on a daily basis and quite frequently due to various reasons, ranging from the most common short-term transient router interface faults, to occasional medium-term router crashes and reboots, to the log-term catastrophic fiber cuts. Apart from hardware problems, software bugs and human errors also play a major role in contributing to these failures and impact the quality of service (QoS) delivered to customers.
Resilience mechanisms are available at multiple network layers, for example, the Optical Transport Network (OTN), Synchronous Digital Hierarchy (SDH)/ Synchronous Optical Network (SONET), IP and MPLS. Moreover, these resilience mechanisms may even be in operation in multiple layers at the same time. While the recovery at lower layers generally has advantages in the time scale of recovery operation, the recovery at the higher layers (IP/MPLS) allows a better resource efficiency, recovery granularity, and QoS granularity.
Aspects of this research field include:
- Resilience at Multiple layers (e.g. SONET-SDH/Optical/IP/MPLS)
- Enhancement in IP routing protocol restoration mechanism
- Improvement in MPLS survivability schemes through traffic engineering
- Algorithms for Backup route provisioning
- Approaches to maintain QoS during failure
- Single and Multiple failures consideration
- Survivability provisioning between Autonomous Systems (ASes)
More details about resilience and related schemes are described below.
There are two main schemes for providing network resilience, namely protection and restoration. In protection scheme backup resources (i.e. wavelengths, routes) are pre-computed and pre-determined in advance (i.e. before failure occurs). While, in the restoration scheme backup resources have to be discovered dynamically once failure occurs.
Protection and restoration schemes can be provided at different layers in an IP backbone. The choice of scheme depends on the three criteria.
- Cause of failures: Lower layer resilience mechanisms cannot detect failures occurring at higher layers. For example, an optical protection mechanism cannot protect failures against an IP router or forwarding software failure. On the other hand, higher layer entities may be able to protect against (or recover from) lower layer failures as long as there is an alternative path between communicating entities.
- Required recovery speed (restoration time): Dynamic mesh and ring restoration capabilities at the optical layer can recover from failures in less than 100 milliseconds. On the other hand, the traditional restoration mechanism at the IP layer is re-routing, which may take seconds.
- Provisioning cost: IP backbone operators need to invest in improving the reliability of their equipment, and also on provisioning spare capacity for use in the event of failures.
In the following we briefly describe resilience in IP and MPLS layers.
IP routing is designed for robust operation and in fact it can re-establish connectivity after almost any failures of network elements. However, failure reaction is not guaranteed to be sufficiently fast and/or efficient in the resource management respect. In fact, the current restoration schemes cannot fulfill the newer and more stringent requirements posed by emerging services such as voice over IP. There are two main approaches for providing IP-based network resilience.
- Reactive: each router periodically probes adjacent routers and broadcasts a link status message on detecting a failure. Routers receiving the failure message re-compute their shortest paths. Typically, the network converge in a few tens of seconds. The drawbacks of this approach are possible loops and consequently packet losses during route convergence. Also, connectivity may be restored through congested paths, even when there are other links in the network that are lightly loaded.
- Proactive: each router pre-computes backup next hop for each destination, so once failure occurs, only routers adjacent to the failure locally reroute the packets according to the pre-computed backup next hop. Such approach does not need to wait for failure message flooding and convergence. As a result, this local reaction speeds up the restoration time. The drawbacks of this approach are its overhead at routing tables, the restriction of this approach to single router or link failure and also it is non-trivial to extend the pre-computation to all possible failures, since pre-computation will be needed for all failure scenarios. Moreover, the current backup next hop computations do not consider the congestion when traffic is re-routed.
MPLS is a technology that provides network packet encapsulation at ingress Label Switch Routers (LSRs) by labeling and forwarding packets along a Label Switched Path (LSP). In case of MPLS many survivability methods have been proposed. MPLS enables fast and QoS-guaranteed restoration at the IP layer. A fundamental consideration in the design of an MPLS survivable network is the creation of backup paths to protect the primary paths from failure while preserving the required QoS which has been widely investigated. The current proposals differ from each other in terms of: the speed o recovery, the amount of resources allocated to backup paths, the complexity of configuration and signaling.
- Protection: in this scheme once the primary path is routed between the source and the destination, the backup path is also provisioned to forward the traffic if the primary path fails.
- Restoration: in this scheme first a primary path is set-up between the source and the destination, and once failure occurs, backup path is discovered dynamically to restore the traffic.
Protection and Restoration Comparison in MPLS-based networks:
Protection schemes are inherently more expensive since resources must be committed without a priori knowledge of the next failure. The cost of protection increases with the scale of failure an ISP seeks protection for. In this respect, restoration is more cost effective since additional resources are allocated only after a failure occurs. However, restoration mechanisms offer slower recovery because of the delay involved in finding these resources after failure has been detected. Also, the discovered backup path may not be able to guarantee the QoS requirements and on top of that there would be no guarantee for sufficient resource existence in the network.
Recovery scopes in MPLS-based networks:
Backup paths types depend on which router along the primary path takes the rerouting decision: this is called recovery scope. Here, we explain only two recovery scopes: Global and Local.
- Global Scope: In this model, the ingress node takes responsibility for fault recovery when a Fault Indication Signal (FIS) arrives (Any message sent to indicate that failure has occurred or causes a recovery action is called a FIS). This method requires the establishment of an alternate disjoint backup path for each primary path. In global scope, recovery is always activated at the ingress node, irrespective of where a failure occurs along the primary path. The advantages of this scope are firstly that it covers the whole network to find the backup path (the backup path can be selected from links anywhere in the entire network), so the network spare resources are used efficiently, and secondly that only one backup path needs to be set up per primary path. However, since a FIS has to be propagated all the way back to the ingress node, this method has high recovery time and packet loss.
- Local Scope: In this model, the LSR at the head of the failed link switches the traffic from the broken link to the backup path. Each backup path therefore protects only a part of the primary path. Since a FIS is not needed, this scope has a faster recovery time and reduced packet loss in comparison to Global scope. On the other hand, creation and maintenance of multiple backup segments are required, resulting in inefficient utilization of resources and increased complexity.
Currently, several proposals attempt to improve the existing IP-based and MPLS-based resilience mechanisms.
- G. Iannaccone et al.,"Feasibility of IP Restoration in a Tier-1 Backbone," IEEE Network Magazine, vol.18, no.2, 2004, pp.13-19
- J. Zhang et al., "A Review of Fault Management in WDM Mesh Networks: Basic Concepts and Research Challenges," IEEE Network Magazine , vol. 18, no.2 , 2004, pp.41-48
- C. Huang et al., "Building Reliable MPLS Networks using a path protection mechanism," IEEE Communication Magazine, vol.40, no.3, 2002, pp.156 - 162
- M. Kodialam et al., "Dynamic Routing of Restorable Bandwidth-Guaranteed Tunnels Using Aggregated Network Resource Usage Information," IEEE/ACM Transactions on Networking, vol.11, no.3, 2003,pp.399 - 410
- E. Calle et al., "Protection Performance Components in MPLS Networks," Computer Communications 2004
- S. Rai et al., "IP Resilience within an Autonomous System: Current Approaches, Challenges, and Future Directions," IEEE Communications Magazine, vol 43, no. 10, 2005.
- A. Autenrieth et al., "Engineering End-to-End IP Resilience Using Resilience -Differentiated QoS," IEEE Communications Magazine, vol 40, no. 1, 2005, pp.50-57.
- S. Lee et al., "Proactive vs Reactive Approaches to Failure Resilient Routing," IEEE International conference On Computer Communications (INFOCOM)2004
- S. Norden et al., "Routing bandwidth-guaranteed paths with restoration in label-switched networks," Computer Networks, vol 46, no. 2, 2004, pp. 197-218
- S. Pasqualini et al., "MPLS Protection Switching Versus OSPF Rerouting," IEEE International Workshop on Quality of Service (IWQoS), 2005
- Y. Bejerano et al., "Algorithms for Computing QoS Paths with Restoration," IEEE/ACM Transactions on Networking, vol. 13, no.3, 2005
- J.L. Marzo, "QoS Online Routing and MPLS Multilevel Protection: A Survey," IEEE Communication Magazine, vol.41, no. 10, 2003,pp. 126 -132
- A. Markopoulou et al., "Characterization of Failures in an IP Backbone Network," IEEE International conference On Computer Communications (INFOCOM), 2004.
- A. Nucci et al., "IGP Link Weight Assignment for Transient Link Failures," Elsevier ITC 18, 2003
- B. Fortz, "Optimizing OSPF/IS-IS Weights in a Changing World," IEEE Journal on Selected areas in Communications (JSAC), vol. 20, no. 4, 2002, pp. 756-67
- A. Sidharan et al., "Making IGP Routing Robust to Link Failures," IFIP Networking, 2005.
- W.Cui et al., "Backup Path Allocation On A Correlated Link Failure Probability Model In Overlay Networks," IEEE International conference on Network Protocols (ICNP), 2002
- M. Kodialam et al.,"Restorable Dynamic QoS Routing," IEEE Communication Magazine, vol.40, no. 6, 2002,pp. 72 -81
- European Network of Excellence for the Management of Internet Technologies and Complex Services (EMANICS)
In the following, there is a brief list of journals, conferences and technical societies related to survivability/resilience. For additions/updates please contact the webmaster.