Some bankers have pronounced the data warehouse obsolete, but Dr. Jim Goodnight says it's more relevant than ever, as cheaper hardware and overlapping informational needs across the organization make sharing a large dataset companywide feasible and worthwhile.

Few people know more about analytics software than Goodnight, co-founder and CEO of SAS, a provider of such software based in Cary, N.C. He's run the company since 1976, when he and some North Carolina State University colleagues wrote the original SAS software to analyze agricultural research data. Today, according to the company, 99 of the top 100 global banks use SAS. Goodnight (who is always introduced as "Doctor," for his statistics PhD) is still hands-on enough to weigh in on his favorite hardware platforms and modeling techniques.

As banks step up their use of analytics in all areas of their business, including fraud, risk and customer intelligence, Bank Technology News asked Goodnight for his take on getting bankers to share data internally; addressing the industry's shortage of data scientists; and handling stress tests. (At one point in the conversation, David Wallace, the global financial services marketing manager at SAS, chimed in.)

Some people say the data warehouse is dead — that you don't want to store all the data you possibly can, but rather build for-purpose databases that let you address specific business problems around things like fraud, risk, and marketing.

JIM GOODNIGHT: It's the same data. Fraud, customer intelligence, compliance -- if you have the right set of data all together, you can use that set of data, you don't have to keep go looking for data every time you need something. That's one thing Hadoop [a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment] provides. It's a great place to store data. Also, you're buying these 1.2 terabyte disks at about $300 apiece, you can hang 20 of these on a server and it's local, so you can read the data straight on the machine.

So where banks used to have multimillion-dollar data warehouse projects, cheaper hardware is making it less expensive and more doable.

It's $100,000, and they can store more data.

How do you aggregate data from many different sources for a unified database without making it a mammoth project that takes three years?

That's one of the hardest things, to get different data owners to work together. It takes commitment from the CEO to say "we're going to do this." I've tried to explain that a lot of the data we need for compliance or anti-money laundering is almost the same data you need for marketing purposes, yet, when you're talking to compliance people, they say, "We're not interested in marketing, we're only interested in compliance." You cannot convince those people this could do something great for the bank. It's maddening.

Do banks need to provide monetary incentives to get people more interested in breaking down the silos?

Banks provide monetary incentives enough to their employees already. We recently hired a guy from a bank and he said, "What about my performance bonus?" I said, "If you perform well, you'll get a bonus." He said, "No, if I'm here April 1, I get paid this big chunk of money." I said, "That's a retention bonus." He said, "No, we can call it a performance bonus." Just paying people a bonus for still being there.

I would like that.

That rarely exists outside of financial services.

DAVID WALLACE: We have a few banks that have chief data officers, such as Charles Thomas at Wells Fargo, those people might help break down silos.

Maybe we'll see more people promoted to that position.

GOODNIGHT: It depends on how much power they have. If the person in marketing or compliance doesn't report to that person, they're not going to listen to them.

Some bankers, for instance some chief technology officers at Sibos this year, were saying you don't have to store all customer data. One banker said he doesn't care if a customer in India just had a cup of tea and posted that on Facebook. There are all sorts of trivia and data you could store about customers, but what you really want are the useful nuggets. He said in a payment transaction, you could collect 60 data points, but only one or two are relevant; for instance, where there was a delay or a failure in the payment.

What data you collect and keep is determined by the models you develop to do forecasting or prediction. In credit card fraud, we maintain 600 variables. Every time there's a transaction, most of them get updated. You build your models, you work and work on your models, and that determines what data you need to keep. Everything else you can throw away, because if it's not in your model you're not going to use it.

Do you think banks are good at modeling?

They spend a lot of time on it. That's the thing about our new high performance solution. You can run models in a minute or two that used to take 14 hours, because you're doing it spread out over massively parallel computing. The ability to model has grown a hundredfold because of the speed of what you can do on this inexpensive hardware. We're seeing more banks beginning to realize they can make their modelers more efficient just by upgrading one piece of hardware.

What about community banks — they typically don't have the treasure trove of data and the staff of data scientists that many of the large banks have.

What we're doing for them is developing hosted apps. We have a Basel III app for small banks. They fill out a spreadsheet and send it to us. We do all the computations, all the reports and make them available to the bank. We're doing the same for the [Dodd-Frank Act stress tests] the midtier banks have to do. It's low-level stress testing, nowhere near what the big banks have to do. Most midtier banks don't have a group of risk experts that know how to do that. We're trying to provide the know-how for midtier banks to do stress testing.

I've heard with stress testing the examiners really want to know how people come up with results. They want to see a big manual that documents the process. Do you provide that too?

Yes. The point is this will help standardize the modeling. The Feds don't tell you what model to use. They look at your model and compare it to theirs, but they never share theirs.

What do you think of the Fed's economic scenarios?

My only concern is they're coming in three levels: bad, really bad, and really really really bad -- technically, standard, adverse, and extremely adverse. With extremely adverse, unless we have experienced some situations like that in the past, chances are the model is not going to work as well. Forecasts are best run in areas that you have complete coverage and past experience. All of a sudden you go out on a limb and say "ok, models, here you are, tell us about this extreme case." I don't know if models will work in that case.

Do you agree with the concept of bringing social media data into credit decisions, or is that unproven?

That's one way to assist decisions on creditworthiness. If somebody misses every other electric payment, you wouldn't want to lend to them.

A struggle banks say they have in competing with online and marketplace lenders like OnDeck that are having success with data driven lending is the banks' costs are high, especially for small business loans.

Their price of capital is zero. My gosh. What is it, 0.2%? There is no cost.

Someone was just telling me there's a German bank that's charging customers to keep more than a half million in deposits.

Some U.S. banks are saying "if you keep euros in your account with us, we're going to charge you 0.2%." That's going to drive customers to other banks.

A lot of banks say they are having trouble finding "data scientists" with the right skills for analytics work — a blend of a facility with numbers, business practicality and common sense. Are you seeing that, and what can you and bankers do to grow that talent?

We have helped establish 30 university master's degree programs on advanced analytics. The first was at North Carolina State, that's the biggest one now. It's been going for about seven years. It's a 10-month course, it's very intensive, it's 9-to-5 and you've got to come in every day. In that course you learn to use SAS, of course, and you learn different methods from different professors to apply to different types of problems. So for loss modeling, you'll be taught logistic regression and for time series, we also have an econometrics group that comes in for a couple weeks that teaches some of the econometrics methods you can apply to data.

Then we have two to three weeks of operations research, where you learn how to optimize data. In a marketing campaign, you might not want to make the same offer on two different channels. All that is an operations research optimization problem. We're dealing with a telecom in Europe that has 18 million customers, but they have a total of 600 offers they want to make to customers during the year. We helped them optimize which ones to use on which channels to maximize their expected revenue. Those kinds of things are not taught in class. It's being able to have lots of data at your disposal. They've got 20 data sets they use that are provided by different companies that are looking for solutions and the students help provide those solutions.

At N.C. State, we graduated 82 students last year. Everyone got at least three job offers, and some got five or six. There's a huge demand for this type of education.

We're sponsoring [this program] in as many places as we can to help them get started. We provide materials free and funding. LSU and Texas A&M have picked up courses.

That reminds me of some of IBM's mainframe courses, which are meant to interest young students in the technology.

We're seeing a shift away from that kind of hardware to commodity. People are dropping AIX boxes and going to Linux x86 boxes and Intel chips. The chipsets are incredible. Dell has a machine we really like, it's called an R920 and it has four slots, you could put four chips each with 16 cores, you end up with a server with 64 processors, you can put three terabytes of memory in that machine, and it's about $100,000.