YOU ARE AT:Network InfrastructureFacebook blames worldwide outage on bad router change

Facebook blames worldwide outage on bad router change

Facebook, Instagram, WhatsApp offline for hours Monday, billions worldwide left without access

Facebook, Instagram and WhatsApp stopped working Monday, leaving billions of users worldwide unable to communicate. Service access was disrupted shortly before noon Eastern Time and lasted for more than six hours. Users attempting to access the services’ web sites saw errors, and mobile apps were unable to communicate with the services for the duration of the outage. The source of the problem, according to Facebook, was a bad configuration change to its own internal network routers.

After services were restored, Facebook Infrastructure VP Santosh Janardhan explained the outage in a blog post

“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt,” said Janardhan.

Janardhan emphasized that user data was not compromised during the outage.

“People and businesses around the world rely on us everyday to stay connected. We understand the impact outages like these have on people’s lives, and our responsibility to keep people informed about disruptions to our services. We apologize to all those affected, and we’re working to understand more about what happened today so we can continue to make our infrastructure more resilient,” he added. 

Network analysis within and outside of Facebook identified the source of the problem as a Border Gateway Protocol (BGP) update to Facebook data center routers. BGP is used to efficiently route data center traffic; updates are routine. The BGP routing error also disrupted Facebook’s own internal networking functionality, which complicated employee physical access to data centers in order to restore service.

Despite its absence from the public cloud market, Facebook seeks to influence carriers where it can. Facebook Connectivity Vice President Dan Rabinovitsj told RCR Wireless in an interview that his business unit spends resources on “things that we believe will inflect the market.”

Those efforts include integration of its open source packet core software with Amazon Web Services’ edge compute offering; and ongoing work with partners on Open RAN reference designs to help accelerate adoption of disaggregated radio systems at scale.

Facebook’s outage is a cautionary tale for public cloud service providers and a stress-test for the Internet itself. As Facebook went dark other social media services like Twitter saw a dramatic spike in user traffic as people around the globe tried to understand what was happening. 

Content Delivery Network (CDN) service provider Cloudflare provided a thorough account of the incident on its blog

“At 15:58 UTC we noticed that Facebook had stopped announcing the routes to their DNS prefixes. That meant that, at least, Facebook’s DNS servers were unavailable. Because of this Cloudflare’s 1.1.1.1 DNS resolver could no longer respond to queries asking for the IP address of facebook.com,” said Cloudflare.

While the problem created Internet traffic disruption for services such as Cloudflare, it was internally isolated to Facebook and its related services only — the rest of the Internet kept working just fine.

“Today’s events are a gentle reminder that the Internet is a very complex and interdependent system of millions of systems and protocols working together. That trust, standardization, and cooperation between entities are at the center of making it work for almost five billion active users worldwide,” said Cloudflare.

ABOUT AUTHOR