The idea behind “SPoF,” or “Single Point of Failure,” is that if one part of a system fails, then the entire system fails. It’s not desirable. In IT and security circles, if a system or application can be disrupted or degraded severely by the failure of just one component or subcomponent, then we usually deem the design to have a flaw.
This brings us to the SPoF that is DNS (Domain Name System). DNS is the digital phonebook for IP addresses and human-readable website names and domains. For example, at the time of writing, www.facebook.com resolves to an IP address of 126.96.36.199. To serve a website, computers and routers need to reach an IP address, but humans can’t (and shouldn’t have to) remember a long series of numbers and dots every time we want to do anything online. Instead, we mortals type in a domain name consisting of words, like facebook.com, and DNS servers convert it to the IP address. While DNS is a fundamental and critical element of how the Internet works, it’s also the root cause of many incident investigations and failures of design, insufficient testing, or inadequate documentation.
To illustrate my point that DNS has been and continues to be a SPoF, I refer to a memorable incident that occurred on October 4th, 2021. On that Monday (of course it was a Monday), a sizable percentage of the world’s estimated 4.9 billion Internet users were impacted by a single change that went not so well for Facebook engineers as they were introducing a configuration to their platform’s infrastructure. Ironically, the change was probably intended to bring an additional degree of resilience to their DNS infrastructure and social media platforms.
Here’s what happened: a single mistake was introduced into Facebook’s BGP routing rules and tables. (BGP, or Border Gateway Control, is the protocol that helps route data on the Internet from one laptop or workstation to other laptops, workstations, and servers.) As a result, all of Facebook went poof out of existence in the blink of an eye. The misconfiguration took WhatsApp and Instagram with it too, as those services and applications also depended on the same core Facebook DNS infrastructure.
So the first responders in the on-call team had no clue what was working and what wasn’t.
What is especially striking about this outage was its duration. Normally, change control documentation includes a roll-back plan in the event that the change does not go as expected. However, some complications emerged due to well-intended (but in hindsight, flawed) design and security considerations. For starters, all of Facebook’s network management tools and applications were also suddenly unavailable and unreachable, so the first responders in the on-call team had no clue what was working and what wasn’t; it appeared that nothing was working. Not a good day to be on call, I should imagine. Even if you had memorized the IP addresses of the systems that needed to be reached in order to reverse the configuration change, due to the nature of the configuration change, no packets could reach those systems. In an almost comical touch, it’s been reported that someone had to drive to a Home Depot near one of the data centers to buy an angle grinder to cut open the data center cage door. Why? Because the desire to harden and secure the systems behind that door had driven the company not to use physical keys to the doors. And, as you might now be able to guess, the badge reader for opening the door with a keycard depended on DNS. Because not all of the engineers near the data center were knowledgeable about BGP configurations or had access to the servers, this resulted in a prolonged outage. So that day, social media users, advertisers, and influencers were forced to take a timeout from promoting their various wares on Facebook, WhatsApp, and Instagram for about six hours.
This was not the first time that DNS going down was the cause of an outage, and it will surely not be the last. Even the most cautious and diligent network architects and engineers miss things sometimes, but they should take heed and learn from these and other DNS failure examples. Your organization may have created a robust and fault-tolerant DNS design with multiple servers running on discrete networks located in geographically dispersed locations. But if you have not taken into account BGP as a point of failure, then you are still at risk of an outage (or an attack by BGP hijacking).
So what can you do to protect your business from DNS failures, both spectacular and mundane? I suggest taking the following steps:
Address the “easy stuff” with regard to proper DNS configurations for SPF records, DMARC, and DKIM. There are literally millions of exploitable domains and DNS servers showing up in SecurityScorecard’s ratings platform. (We scan all of IPv4 every day.) The observed misconfigurations are easy to fix, and our issues reports can be downloaded for your company’s digital footprint for free.
Make a point of inspecting the DNS health and security posture of your core service providers and third-party vendors. Their lack of attention to the SPoF that is DNS can disrupt your business availability too.
Look into introducing DNSSEC, which strengthens the authentication of DNS using digital signatures based on public key cryptography. That will make it harder for the bad guys to hijack your traffic and impersonate your services, as was the case with a recent incident involving cryptocurrency theft.
Make sure that you have at least two different DNS providers that are served from different Autonomous System Numbers (ASNs). You can look up the ASN of any IP address using this page from the great folks at Team Cymru: https://asn.cymru.com/
There are many examples and stories to be told in this same vein where the culprit is DNS or DNS security. “It’s always DNS” is a bit of a mantra for those who have built and managed Internet services and networks for many years, like myself.
But I hope you’ll take the above into consideration, and it won’t be DNS.