Banks Turn to 'Spark' Technology to Crunch Big Data

September 25, 2015, 3:38 p.m. EDT 8 Min Read

For several years big data has been nearly synonymous with Hadoop, a relatively inexpensive way to store huge amounts of data on commodity servers. But recently banks have started using an alternative called Spark, with its promise of faster speeds and ability to handle diverse data types.

One such financial institution, which has brokerage and retail banking operations, has used big data to reduce customer churn by 25% by extracting more meaningful information from its interactions with clients, said Rob Thomas, vice president of product development, big data and analytics at IBM.

"Naturally, they have spilt platforms between banking, retail, trading and investment," Thomas said. "But they want a single view of a party, whether it is a consumer or another company. They needed a consolidated view of the relationship."

The data necessary for that consolidated view resides in different systems. To bring it together, the firm uses Apache Spark, an analytical engine that runs in-memory and is up to 100 times as fast as popular data platforms Hadoop and MapReduce.

"They use Spark as a unifying layer," he said. "Then they can build analytic models on top of Spark, access data from each repository, and use machine learning to automate the analytics for each of those customer records and correlate them into single customer files that they can pass on to marketing."

Another financial firm wanted a way to analyze the text in regulatory filings, both their own and competitors' reports.

"They needed an analytical layer on top of Hadoop that can analyze the text of regulatory filings, to look at patterns of what is happening, the marketing around them and their competition," Thomas explained. "They wanted to comb through annual reports and regulatory filings stored in Hadoop and use our tools to analyze text using Spark."

Tom Davenport, a big data expert, finds banks often use a variety of programs to manage data. A book he co-wrote called "Big Data at Work" cited the example of a bank that had four alternatives: a Hadoop cluster, a Teradata Aster big data appliance, a light but flexible data warehouse and a heavy duty Teradata enterprise data warehouse.

Emmett Cox, senior vice president for customer and business intelligence at BBVA Compass, said the bank has to assimilate lots of disparate pieces of data including text, streaming, traditional data and find ways to organize it.

"That's the magic of big data. Some organizations have tremendous amounts of data — that's just volumes," added Cox, who has worked at Walmart and written a book on the topic, "Retail Analytics."

BBVA several years ago replaced an old core system with Accenture's Alnova, which operates in real-time so it can manage transactions more quickly. Compared to retailers, banks don't have as many transactions but they have to transact as the customer requires and provide a response to the consumer.

"As we get into digital, the bank's relationship with customers is much more one-to-one," Cox said. "As you get into the digital side with millennials and new customers, the one-to-one relationship is core to what they need on an individual basis; it's not so much mass communication anymore, so it forces you to understand the data."

Thomas said big data technology is taking off and Hadoop is evolving to provide dumb storage to Spark projects, which can run at a hundred times the speed of Map Reduce in providing analysis, expanding its role.

"Spark is still very young," Thomas added. "Half the banks have never heard of it."

IBM has opened a Spark technology center in San Francisco where it expects to hire 300 people, the largest investment the company has made, akin to its support for Linux years ago, he added.

David Wallace, global financial services marketing manager at SAS, said at many banks Hadoop implementation is still a work in progress.

"We see Hadoop as a data-storage technique, and we have been continually enhancing our ability to interact directly with it, starting with traditional SAS access routines, but now we can analyze the data where it sits in the Hadoop ecosystem."

Banks really want to understand exactly what customers are doing at any point in time, on the Internet and by mobile, he added, so they want the information as close to real-time as possible.

SAS has been used by banks for 30 years, so it gets to see what they are doing, what works, and what doesn't.

A large European bank has been mining all of the data associated with customer complaints, he said, providing an example of big data at work.

"Before using big data effectively, they were only analyzing about 5% of the complaint data that came in. Now they are mining all the texts from various systems including call centers and email."

One advantage of big data is that it captures all the information while older systems use sampling to detect patterns, often losing valuable information in outliers at either end of the curve. The bank in Europe could see if customers were having problems in navigating the online banking site, and change it to make it easier to user.

The firm reduced complaints by 25% and improved customer and employee satisfaction by 100%intwo years, said Wallace. It was able to retain more customers and also in some cases double their balances by proactively contacting them if there was an issue. It could also identify customers who had less than optimal products and reach out to suggest another product that had more features and was cheaper.

"The bank can focus efforts on trying to improve the customer relationship."

A rather unexpected bonus was that text capture showed some customers were sending in compliments that had been lost in the previous system. Capturing those and passing them along to staff proved to be a great motivator.

An Asian bank saw a 60% uptake in offers made through its call center because it had a more accurate view of customers.

"It's not just making customers happier, it generates added revenue, $200 to $300 million a year."

Some SAS customers in eastern Europe are using SAS Credit Scoring to offer and approve loans at Lending Club speed, he said.

"Banks are concerned about fintech competitors, but the same techniques and rigorous analytical insights can be derived with SAS."

Clearing Confusion about Spark
Abhi Mehta, founder and CEO of the financial services big data company Tresata, said that most banks are using only about 5% of their data. While the promise of big data is getting broad attention, not everyone is up to date.

By its nature, a bank is a fabric of data," said Baltagi. "It generates huge amounts of data on a daily basis about clients, competition and what they are doing."

Moving to enterprise, big-data platforms requires some caution. Open source has changed the business and now an Apache certification is valued more highly than Oracle or IBM, he said. But banks need to consider what in the open source ecosystem will stick and who will provide enterprise support.

While firms can now find some experts in MapReduce, Spark, which is superior in many ways, still has only a small number of developers who understand it, said Mehta. It is 100 times as fast as MapReduce, and can run in-memory or on disk on data stored in Hadoop, NoSQL or cloud storage such as Amazon S3 or Databricks Cloud. Mehta said a brilliant aspect of Spark is that it takes advantage of the unused memory in a Hadoop cluster to run processes faster when required. SAP, which announced an agreement with Databricks to use Apache Spark a year ago, launched a big data query engine called HANA Vora at the beginning of September, another indication of how quickly this field is moving.

"Some of these firms missed Hadoop and they don't want to miss Spark," said Mehta.

Spark is not fully streaming in its processing, but runs micro-batches that in as little as 0.5 seconds — very fast, but probably not as fast as low latency financial apps would require. For that, firms can turn to Flink, formerly known as Stratosphere, from Berlin's Technical University.

Mehta said it is still too early to tell whether Flink will become successful. Another new big data tool is Apache Parquet, from Google, which uses a columnar storage format for high performance.

Following some of the online conversations about the new tools shows how fast they are moving. New releases of Spark fix some earlier problems, Flink became an Apache Top Level Project at the end of 2014 and a beta of a Python API for it has appeared. The program names and abbreviations come fast and furious — YARN, Storm, Kafka, Flume, Chukwa, Sqoop, Mesos, Kubernetes — no wonder banks and bank technology groups get confused.

Vendors aren't helping, said Baltagi, who thinks they intentionally confuse prospects.

"This is first generation, there will be other generations, and we don't know what they will be, but Spark is not the end of the world," he added. "If you bet on only one tool in the big data space, that is a recipe for failure. You'd better be friends with all these tools."

Mehta said he has seen the confusion in visits to banks. One bank CEO he met with assumed Hadoop was some sort of search tool since he had heard it came from Google. The lead architect of big data at another bank didn't realize that Hadoop wasn't just storage, it had a computation engine as well.

Many don't understand the scale of their data — they have data warehouses with capacity of 100 or perhaps 200 terabytes while the bank's total data is 100 petabytes. The resulting mismatch means banks end up sampling, while big data platforms hold the promise to store and analyze all the data, including the often informative outliers that sampling discards.

"I ask how much of your data is under management, actually monitored, and at first they don't understand the term," said Mehta. "For their part, regulators don't understand that the technology is here to allow a dynamic view of the balance sheet risk at a bank. They do understand that what is being done doesn't work."