CloudFlare DDoS Mitigation Led To Entire Service Outage After Router Failure

Web security service CloudFlare was offline for about an hour this afternoon after their efforts to minimise the effects of a DDoS attack against one of their customers led to a router failure.


The service, which adds a security layer between some 785,000 websites and their users, is back up and working fine now but during that hour every single one of those web sites was down.

Fortunately for customers affected CloudFlare did a good job of communicating with them via social media:

This was later identified as –

“The cause of the outage was a system-wide failure of our edge routers.”

All of CloudFlare’s edge routers are from Juniper –

“We are largely a Juniper shop at CloudFlare and all the edge routers that were affected were from Juniper. One of the reasons we like Juniper is their support of a┬áprotocol called Flowspec. Flowspec allows you to propagate router rules to a large number of routers efficiently.”

The DDoS attack mentioned earlier targeted one of their customer’s DNS servers with packets in the range of 99,971 – 99,985 bytes in length. As this is way, way bigger than average packet sizes a rule was written to drop all packets that fell within those two sizes –

“Flowspec accepted the rule and relayed it to our edge network. What should have happened is that no packet should have matched that rule because no packet was actually that large. What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed.”

Whats refreshing here is the speed with which the company both dealt with the issue and also the openness of their communication with their customers. It also looks like they will be issuing service credits to all of their customers who experienced downtime.

