Two of the most popular services on the web struggled through outages on Friday.
Blogging portal Tumblr blamed a severed fibre connection for a failure which cut its services for several hours, while Google struggled to repair a failure which took down its App Engine platform and slowed some of its services.
Tumblr founder and chief executive David Karp said that the company’s woes began when its ISP suffered a cut to one of its fibre cable lines. Though the company had backup measures in place, the fallover systems failed to properly deploy and customers were left without access to their blogs for several hours during the day.
“The major failing here was an oversight in how we had been performing ongoing testing of these backup links,” Karp said.
“Our checks missed an obscure but critical issue in these connections that prevented production traffic from being served properly.”
Google, meanwhile, said that its App Engine service had experienced a crash of its own. The company said that the failure, which began early Friday morning local time, forced engineers to completely restart the App Engine routing system and took nearly eight hours to fully return traffic to normal levels.
Following an investigation, the issue was traced back to an increase in traffic which lead to a cascading failure of the entire App Engine routing platform. While service was disrupted, the company said that no application data was lost in the incident.
“In response to this incident, we have increased our traffic routing capacity and adjusted our configuration to reduce the possibility of another cascading failure,” Google App Engine director of engineering Peter Magnusson said in a blog post.
“Multiple projects have been in progress to allow us to further scale our traffic routers, reducing the likelihood of cascading failures in the future.”