BankThink

Oops, I Brought Down the Bank: A Cautionary Tale

November 06, 2013, 12:00 p.m. EST 3 Min Read

Bank technology

RBS Battles Social Media Storm During Online Banking Outage

As a hardware problem brought down its website, Royal Bank of Scotland customers vented over Twitter and Facebook.

By Mary Wisniewski

March 7
Bank technology

JPMorgan Chase Endures Website Outage

JPMorgan Chase's website suffered outages for a few hours on Friday. The nation's biggest bank by assets says it is researching the cause.

By Brian Browdie

February 15
Bank technology

B of A Suffers Website, Mobile Outages

An internal systems issue kept Bank of America's website down for 10 hours on Friday.

By Brian Browdie

February 4

I've spent many years managing IT teams. One such team was a group of engineers responsible for servers at a large bank in New York in the late '90s. Even though that was 15 years ago, many challenges I experienced remain relevant today. Here's a good example.

Like many IT groups, we simply didn't understand the business importance of what we were managing until a bright Wednesday morning in November. What started with a simple need to make small changes to the login script for bank employees turned into a catastrophe.

The morning of the changes, I began to copy and paste between an email with the changes and the admin console a two-minute job at most. As I was copying and pasting the information, a few folks walked in and we started chatting about Thanksgiving, while I continued working in the background.

When I hit the save button, I received an "Are you sure?" prompt. I moused over the "yes" button and began to raise my finger to click. Just then, I noticed behind the prompt was a completely blank login script. Everything immediately went into slow motion as my index finger clicked the mouse.

I had just erased the login script for all bank employees.

I started to panic. Our IT backup person wasn't due in until 8 a.m., so I checked with another operations guy, and he said, "Sure - what file do you need?" I said, "It's not a file, it's an NDS object."

"A what? I don't know what that is or how to restore it," he said.

I also realized there was no way to rebuild the script from memory either. Grasping for any solution, I thought someone must have a printout. I found hard copies, but the newest one was three years old and I was advised not to use it because it had more than likely changed significantly since then. I then escalated the backup; the operations team was already working on it, but the progress bar said it would take four hours to complete.

I decided I needed to tell my boss and then he asked me to call his boss, our VP of Infrastructure. I told the VP what happened, how we were trying to recover and what the expected completion time would be. He was silent for a while, then he asked me what I thought the business impact of this was. I went on about IT stuff like mapped drives and printers. He stopped me mid sentence.

He said he knew the technology impacts, but what he wanted to know was the business impact.

He rattled off a long list of services the bank would not be able to perform: the ATM network, branches not being able to open, checks not printed, and thousands of employees who would not be able to work.

Based on his quick assessment, it was clear I had pretty much shut down a $12 billion bank with a few keystrokes and couldn't reverse the damage.

With better documentation and a history of changes to the script, I could have recreated it from scratch in an hour or less. If I had known the business value of the script and its centrality to critical functions, I wouldn't have touched it without a more rigorous impact analysis and would have collaborated with my peers to identify ways to reduce risk.

Services were restored in about three hours, around 11 a.m. Afterward, we updated procedures to make sure we had a fallback plan and current configuration information at all times. For my honesty, I was thanked even though I had caused the problem.

Today, these types of issues occur in IT every day. Our technical environments have become so complex and the pace of change so rapid that no individual, however experienced, can understand them end-to-end. We need more collaboration among experts to ensure knowledge is shared, relevant stakeholders are informed, and risks can be reduced. If an IT problem has roiled your organization recently, there's an 80% chance someone in IT changed something they didn't fully understand.

Joe Rogers is the director of technical services for ITinvolve, an IT management company.