-
As a hardware problem brought down its website, Royal Bank of Scotland customers vented over Twitter and Facebook.
March 7 -
JPMorgan Chase's website suffered outages for a few hours on Friday. The nation's biggest bank by assets says it is researching the cause.
February 15 -
An internal systems issue kept Bank of America's website down for 10 hours on Friday.
February 4
I've spent many years managing IT teams. One such team was a group of engineers responsible for servers at a large bank in New York in the late '90s. Even though that was 15 years ago, many challenges I experienced remain relevant today. Here's a good example.
Like many IT groups, we simply didn't understand the business importance of what we were managing until a bright Wednesday morning in November. What started with a simple need to make small changes to the
The morning of the changes, I began to copy and paste between an email with the changes and the admin
When I hit the save button, I received an "Are you sure?" prompt. I moused over the "yes" button and began to raise my finger to click. Just then, I noticed behind the prompt was a completely blank login script. Everything immediately went into slow motion as my index finger clicked the mouse.
I had just erased the login script for all bank employees.
I started to panic. Our IT backup person wasn't due in until 8 a.m., so I checked with another operations guy, and he said, "Sure - what file do you need?" I said, "It's not a file, it's an
"A what? I don't know what that is or how to restore it," he said.
I also realized there was no way to rebuild the script from memory either. Grasping for any solution, I thought someone must have a printout. I found hard copies, but the newest one was three years old and I was advised not to use it because it had more than likely changed significantly since then. I then escalated the backup; the operations team was already working on it, but the progress bar said it would take four hours to complete.
I decided I needed to tell my boss and then he asked me to call his boss, our VP of Infrastructure. I told the VP what happened, how we were trying to recover and what the expected completion time would be. He was silent for a while, then he asked me what I thought the business impact of this was. I went on about IT stuff like mapped drives and printers. He stopped me mid sentence.
He said he knew the technology impacts, but what he wanted to know was the business impact.
He rattled off a long list of services the bank would not be able to perform: the ATM network, branches not being able to open, checks not printed, and thousands of employees who would not be able to work.
Based on his quick assessment, it was clear I had pretty much shut down a $12 billion bank with a few keystrokes and couldn't reverse the damage.
With better documentation and a history of changes to the script, I could have recreated it from scratch in an hour or less. If I had known the business value of the script and its centrality to critical functions, I wouldn't have touched it without a more rigorous impact analysis and would have collaborated with my peers to identify ways to reduce risk.
Services were restored in about three hours, around 11 a.m. Afterward, we updated procedures to make sure we had a fallback plan and current configuration information at all times. For my honesty, I was thanked even though I had caused the problem.
Today, these types of issues occur in IT every day. Our technical environments have become so complex and the pace of change so rapid that no individual, however experienced, can understand them end-to-end. We need more collaboration among experts to ensure knowledge is shared, relevant stakeholders are informed, and risks can be reduced. If an IT problem has roiled your organization recently, there's an 80% chance someone in IT changed something they didn't fully understand.