Introduction

The Domain Name System (DNS) infrastructure is a hierarchical distributed naming system for computer, services, and resources connected to the Internet. The hierarchical naming scheme of DNS starts with so-called top-level domains, e.g., ccTLDs for country top-level domains, gTLD for generic top-level domains. A requirement for operating a TLD is high availability and low latency, amongst others.

High availability can be achieved by distributing services, and for DNS this implies that the set of name server for the TLD is distributed, geographically and topologically (wrt. Internet). By distributing and strategically positioning DNS name servers for a TLD, one can also reduce the average latency for the clients resolving a name in the TLD zone, which contributes to a fast "Internet experience" for end-users.

Distributing DNS name servers can be achieved by different methods, but for TLDs one specific method has many advantages: DNS anycast addressing and routing. With anycast routing, one can take advantage of the robustness of the BGP routing infrastructure, where the same server IP address exists in multiple locations, possibly on different continents, to provide a decentralized service. While conceptually simple, namely the simultaneous announcement of an IP address (range) from different networks on the Internet, it is not trivial to implement. In particular if latency, robustness, and resilience are considered in the equation for selecting the locations of anycast nodes.

An excellent document describing operational considerations for DNS anycast is "Operation of Anycast Services", by J. Abley and K. Lindqvist [1]. From this document it becomes clear that many (dynamically changing) parameters have to be reviewed in the decision process where and how to place anycast nodes on the Internet. For a dynamic system like the Internet with varying traffic and DNS query load, it is difficult to find the optimal distribution of name server nodes. Over-provisioning is standard practice, but in case of incidents - accidental or with malicious intent - DNS traffic should be redirected to adapt to a new situation. The BGP routing infrastructure might provide connectivity over changes in the network, but for optimality of service, other parameters have to be taken into account also.

This project proposal focuses on solutions for dynamic DNS anycast services to deal with changes in Internet connectivity, DNS query traffic, and other factors influencing their service in terms of availability, performance, and possibly security. And while optimizing for these quality of service terms, the operational costs have to be considered also. To achieve these operational performance and cost goals, we believe an automated management system potentially offers the best possible course of action. We call this concept Self-Managing Anycast Networks for the DNS (SAND).

The main objectives of the project are two-fold. The first objective is the development of a graph theoretic approach for optimal placement of DNS anycast nodes, given the Internet topology and some operational performance and costs parameters. The result is a SAND "node placement graph", which is a description of the anycast network for one a specific snapshot of the Internet's state. The second objective of the project is to design, develop and evaluate the SAND system, which adds self-management capabilities to existing DNS anycast services. The three responsibilities of the SAND system are: (i) monitor the pivotal performance parameters of the DNS anycast services, (ii) continuously and dynamically recalculate node placement graphs, and (iii) dynamically instantiate new anycast nodes in the form of virtual machines, using the nodes placement graphs and the capabilities of parties that are capable of hosting nodes. We expect that the SAND system will come with a tool that runs on a longer time scale and that uses (i) and (ii) to determine the locations in the network that require a physical anycast node rather than a virtual one.

The resulting SAND-based DNS anycast infrastructure provides self-management capabilities by optimizing operational performance and costs, and improves on security and denial-of-service resilience.

Research Problem

BGP anycast is used for high-availability of global DNS services. By the nature of BGP anycast, the query load is distributed over the name servers according to the "catchment" determined by the BGP routing protocol [1]. The catchment of a DNS name server is the topological region of a network within which packets directed to an anycast address are routed to one particular node. By strategically placing anycast nodes on the Internet topology, the average network latency can be decreased and load-sharing of DNS queries over the anycast nodes can be achieved. The robustness of the BGP protocol in case of network problems makes that packets are automatically rerouted to the alternative best destination anycast node. This best alternative destination is according to BGP routing metrics, which are not necessary in-line with DNS performance metrics.

Operating a consistent DNS anycast infrastructure is a complex task. Being a distributed infrastructure, availability and service may vary according to the location of the client (observer). The infrastructure should be closely monitored whether a consistent service is provided; typical service parameters include availability, response times (a product of average latency, server load, etc.), data synchronization, but might also include trust in correct operation (e.g., abuse of high-profile names in a zone, service compromise, or service hijacking). To place an anycast node in such a way that it provides an optimal catchment for a certain topological region, one needs a node with low latency, uncongested paths. The node capabilities need to be able to handle full, global load of client requests, meaning well-connected and internally resistant to failures. The load-sharing is a coarse and not necessarily balanced distribution of load across anycast nodes, but allows the infrastructure to scale to increased number of queries and accommodate transient query peaks. In controlling operational costs, a trade-off in deploying an anycast node is the decision to run the node on its own, dedicated server (possibly payed by the DNS anycast operator) in a network, or on a virtual server in a network, for example making use of services of Akamai, Amazon, or Cloudfare for Infrastructure as a Service (IaaS) services.

The document "Operation of Anycast Services" [1] lists a number of other considerations in the design and deployment of an anycast service, including signaling service availability if more than one single service is running on addresses that are covered by the anycast prefix, or assessing routing policies of other peoples' networks. Node autonomy and self-sufficiency determines the degree which nodes can survive failures elsewhere, and hence precluding cascaded failures. All this adaptivity to a changing environment and its robustness to deal with unexpected events comes with an added complexity in infrastructure management, which this projects aims to automate as much as possible.

Overall Concept

In the past decade, various data distribution schemes have been proposed and studied in academia, and a number of them have been deployed in the real-world. One of the most popular schemes is peer-to-peer data distribution, which is scalable, robust to failures, and enables load-sharing and reduced average packet latency. And although a peer-to-peer approach can be an interesting methodology, the results of the project need to be practically deployable and operational. For this, a solution that fits with current standard practices is required, thus extending on the practice of anycast routing. Although the requirement of practical operational deployment might limit the exploration space, this project typically embodies an applied research project.

The main challenge in providing a DNS anycast infrastructure is finding optimal topological locations for placement of anycast nodes. An important problem to solve is how to map DNS performance metrics to BGP routing information, or vice versa depending on the starting point of the optimization cycle. And to be effective, one has to take into account the dynamic behavior of the network (e.g., BGP routing, congestion, ...) and DNS server load and security (e.g., flash crowds, DDoS, ...). This translates in dynamically enabling/disabling anycast nodes at diverse topological locations for offloading traffic and to continue a consistent DNS service.

Our aim is to extend existing DNS anycast networks with the SAND system, which provides self-management capabilities under dynamically changing situational conditions. Such a SAND-enabled service may not deter the robustness of DNS anycast, and hence should be distributed and autonomous in itself to provide the same or higher scalability and robustness levels. Distribution and autonomy in management are two central concepts in autonomous computing and networking [2]. In the autonomic computing model, there are a number of self-managing tasks which are typically classified as:
  • self-configuration: flexibility, adaptability, ...
  • self-optimization: load-sharing, reduce latency, ...
  • self-healing: recover from failures, ...
  • self-protection: security, integrity, trust, ...
The central idea in self-management is the monitoring, analysis, plan and execute loop (MAPE loop). This loop, or actually feedback loop, constantly monitors the managed system. Depending on the objectives, it is important that the appropriate operational parameters are monitored with sufficient detail and semantics that is used in the next analysis phase. In the analysis phase, the readings of the monitoring module are interpreted and determined whether the values are within the operational modi that are specified. If some values are out-of-bound, root cause analysis is required to determine the underlying reason of malfunctioning. With the root cause, the plan phase derives a series of actions to repair the malfunctioning to direct the system in proper operation modus. The execution actually puts the plan into action and orchestrates the individual steps in the plan with the distributed components of the self-managing infrastructure. To close the feedback loop, the monitoring module observes the changes in the system to evaluate the effect of the changes made, and if necessary act upon this to converge to a stable desired modus of operandi. In each step of the feedback loop, knowledge about the infrastructure is used to analyze, plan, and execute actions. Therefore, the feedback loop is also called MAPE-K, as knowledge is central in its operation.

The MAPE-K component of the system can be centralized or completely distributed. In figure 1, we give an example of a two layer hierarchical self-management system applied to a SAND-enabled DNS anycast network. Local MAPE-K modules monitor, analyze, plan, and execute a single anycast node. If malfunction can be solved locally, plan and execution only acts upon the single anycast node. If a more coordinated action is necessary to solve a problem, information (raw, analyzed, interpreted, ...) is sent to the global MAPE-K module to make strategic and coordinated plans. Thus, the global MAPE-K module does not deal with details that can be solved locally, which is necessary to run a scalable infrastructure. And this allows us also to break-down the complexity into local, detailed knowledge at an anycast node, and global, strategic knowledge at the global MAPE-K module.


Figure 1. SAND-enabled DNS anycast network using MAPE-K: global (left) and local (right).

With a flexible and adaptive anycast infrastructure, that can proactively plan or reactively act on changes in the system, some (sufficient) degree of freedom is required to find solutions to problems. That is, if one can only place anycast nodes at a limited number of topological locations, the solution space is restricted or the number of locations needs to be extended (e.g., through additional contracts with organization able to host DNS anycast nodes). Thus, with the deployment of SAND-enabled anycast infrastructures, a "rich" set o available topological locations is required. The locations could exist of dedicated servers owned by the organization running the anycast infrastructure, they could be a virtual server as part of an IaaS offered by a commercial party, or they could be a combination thereof. With the proliferation of could computing, we do however expect the availability of such IaaS facilities to grow in the near future. Thus for the practical deployment, there should be an interplay between the dynamic determination of optimal (strategic) anycast node locations and the mapping of these locations to infrastructure providers (either hardware servers or virtualized IaaS).

Impact

The SAND system reduces the complexity of managing a DNS anycast infrastructure, and provides flexibility adaptability to act upon changes in the network and DNS client behavior (flash crowd, DDoS, etc.). In its operation, the system can also reduce operational costs for a DNS anycast operator: it is not only adding or moving anycast nodes, but if usage patterns indicate that certain nodes can be shutdown, this can reduce costs while performance metrics are still within specified bounds.

For the end user, an adaptive DNS anycast infrastructure potentially improves the responsiveness for DNS queries, but more importantly, it will improve availability of the DNS service under security and DDoS attacks. This improves the "Internet experience" of end-users, which stimulates the use of the Internet and increases its value.

Another positive impact will be the trustworthiness of the DNS anycast service. DNS anycast nodes can reside in various countries with national regulations. By strictly monitoring the anycast nodes and their catchment, the SAND infrastructure can assure that anycast nodes do not serve clients outside a country where national regulations apply.

For service providers, a SAND-based DNS anycast infrastructure provides additional availability across different locations. The benefits if this are similar as for end users using the service, i.e., improved availability of the service under DNS security and DDoS attacks.