


Hello,
We had a routing problem last night due to a software bug affecting two core routers in Roubaix. These Cisco ASR 9010 provide bandwidth for data centers in Roubaix (RBX1 RBX2 RBX3 RBX4 RBX5) and the connection to Paris, Brussels, Amsterdam, London and Frankfurt. In short, the core routing in Roubaix.
This bug is known and was linked to new cards we have put into production in late January (24x10G slots). For some reason the random map will detect
ECC RAM errors and no longer route packets. But especially nevertheless the card do not have state "down" and remains in the router as if it was good.
Other routers continue to send packets but in front there is no one. Everything falls into a black hole and the network no longer works properly.
Worst case: a failure not net.
That night, three cards of 2 24x10G ASR 9010 routers had this bug and almost simultaneously. This has broken the network into 3 pieces: United States / London / Amsterdam / Warsaw Roubaix and Paris, Frankfurt, Madrid, Milan, by drawing the packets in Roubaix. Usually the traffic would
been rerouted but there it was aspirated and blocked in Roubaix.
So we were not able to administer the network and retrieve logs from all routers to know the origin of the problem. We sailed to the old, with connections
emergency / outside to connect to each backbone router to check if the router which is causing the problem. This operation took time, because in addition to two routers have been down and it had been slow to understand that it came not just router rbx-g2-a9 but also because of rbx-g1-a9. Once we restarted
the three cards all came back in 5 minutes.
There are about three weeks. We have already opened a ticket to Cisco about this problem of RAM ECC. Cisco has worked on the problem and we could
provide .. This morning the software patch to be applied on routers to fix this problem here. We will do this tonight. No failure to predict.
It also looks at how to improve the management of our routers in the case where the whole backbone is down for some reason that hopefully never comes. It can handle this case but it is slow. Very slow.
In all cases, the outage lasted more than 99.9% ie 1:22 when we have "right" in 43 min months of downtime. There is therefore the Penalties
triggers for exceeding the time allowed.Example: on OVH Dedicated Server is 5% per hour of downtime.
You will need to be logged in to be able to post a reply. Login using the form on the right or register an account if you are new here.
Register Here »