Banks seeking information on how easily embarrassing network outages can occur only have to look as far as Google—the actual company, not the search engine.
An early May outage affecting about 15 percent of Google users was caused by a single router configuration error, forcing Web traffic to be routed through Asia and eventually jammed. In a Google company blog, data center supervisor Urs Holzle wrote that many outages at Google are caused by configuration issues and rate of change, likening the process of upgrading infrastructure to changing the tires on a car while traveling on a freeway.
“If an outage can happened to Google, it can pretty much happen anywhere,” says Lou Nardo, a vp at Netcordia, a network management firm that counts Citizens Bank, Fannie Mae, Oppenheimer and Morgan Keegan among its financial services clients.
Yama Habibzai, also a vp at Netcordia, says the takeaway for financial institutions from the Google outage is that changes in configuration can very easily and unwittingly cause network outages. “As long as people are typing things into the system, there’s the chance for errors,” Habibzai says.
Habibzai says human error can’t be avoided, but there are ways in which possible errors can be located. For example, he says dashboards can chart configuration changes, and the possible impact downstream in the network. “If a change happens, you know how the change was made, when, and by whom,” Habibzai says. “You also know the impact of the change, and that’s where the value is.”
Internal Netcordia research found that as much 80 percent of network outages were caused by changes, and 400 respondents to a company survey on network security said the number fear of firms is outages caused by internal human error. And these errors can be hard to find because of the size of most networks. Additionally, some errors may not trigger an outage for a long period of time because the mistake is in a redundant link in the network that isn’t accessed unless traffic increases beyond a certain threshold.
“Someone may have put a wrong address in a router, and it may have been there forever but wasn’t triggered until there was a spike in demand at the network,” Habibzai says.