What type of data is fair in credit models?

By Penny Crosman October 31, 2017, 9:14 p.m. EDT 13 Min Read

It's long been a mantra in the fintech community: Traditional underwriting models that rely heavily on conventional credit scores leave out people who haven't built up a credit history. A percentage of these people are creditworthy, but without a history to go on, the credit bureaus haven't created profiles of them yet.

To assess whether unscored people can repay loans, lenders are increasingly looking at "alternative data" — information that comes from someplace besides a traditional credit bureau that can help predict how a potential borrower will behave. Examples include bill payments for mobile phones and rent.

Many online lenders use this type of data and some traditional lenders have been experimenting with it.

But a growing chorus of observers wonders whether the use of alternative data actually helps the disadvantaged or rather allows lenders to flout the principles of fair lending and disparate impact.

Illustration concept showing how data is gathered and used.

When the Consumer Financial Protection Bureau granted a "no action" letter to the online lender Upstart Network in September, it further stirred up the debate around the use of alternative data. (A no-action letter advises recipients that the staff has no present intention to recommend initiation of an enforcement or supervisory action, meaning they can proceed as they are for now.)

"The bureau is exploring ways that alternative data may be used to improve how companies make lending decisions," the agency said in its letter.

In exchange for this promise of no action, Upstart will share certain information with the CFPB regarding the loan applications it receives, how it decides which loans to approve, and how it will mitigate risk to consumers. It also will share information on how its model expands access to credit for traditionally underserved populations.

"Because a machine learning-based model can change every day, typically in small ways, we have built a monitoring system to supervise what the lending system is doing, and that system will report the data to the CFPB on a regular basis," said Dave Girouard, the chief executive of Upstart.

The hope at the CFPB is to use this information to better understand how these types of practices impact access to credit generally and for traditionally underserved populations.

The agency has been studying this since February, when it launched an inquiry into the use of alternative data. Its concern is whether lenders can use alternative data and still comply with the Equal Credit Opportunity Act and Regulation B.

The two regulations prohibit creditors from discriminating against a potential borrower on the basis of race, religion, sex, age, color, national origin, marital status or receipt of public assistance.

Dave Girouard, the chief executive of Upstart Network

"Although the general principles reflected in ECOA and Regulation B are clear enough, the expected evolution of Upstart's automated underwriting model and potential changes in the applicant pool over time result in substantial uncertainty concerning the facts to which those principles would be applied and what actions Upstart should take to prevent, mitigate, or remedy potential discrimination that might arise," Upstart wrote in its application.

In the same vein, the rules around disparate impact can be hard to assess in practice.

"It's not as simple as women should be 50% of loan approvals, it's more about how our model approves them on a relative basis, other things held equal," Girouard said. "There are different research camps about what is the right way to assess disparate impact. It's one of the academically debated topics."

'Safe' alternative data

Banks already use some alternative data, including employment and payment histories, in their loan decisions.

The alternative data credit bureau eCredable scores consumers, at their request, by getting data from landlords, power companies, day care centers, phone companies and such. (About 80% of this data gathering is automated, and 20% is done through phone calls.)

"Our hypothesis was that of the 45 million or so so-called 'credit invisibles' in the U.S. who are not scorable, probably a third are creditworthy but they just don't have a score to prove it," said Steve Ely, eCredable's chief executive. "We're trying to go after that third near the top that have a history of paying bills on time. If we can get to that history and get it into our scoring model, we can score them and present them to a lender."

The scores are intended to be a proxy for the traditional FICO credit score.

"We didn't invent a new credit score," Ely said. "We don't go to a bank and say we have this really wild and crazy innovative new credit score we want you to use to lend with. Because that's a very short conversation."

BBVA Compass uses eCredable's scores to underwrite an unsecured credit card. The average eCredable score in that credit card portfolio is 700.

"As a lender, if you can find someone who has no credit score but acts like a 700, you can charge them like a 640," Ely said. "There's a lot of margin in that portfolio."

Professional data also appears to be fair game. Upstart, for one, considers college major and field of employment in its models.

"Even in times of recession or unemployment, nurses are the types of individuals that in all statistical likelihood will be steadily employed," said Girouard. "Teachers also tend to be steadily employed because there's almost always a shortage of teachers." Both professions are also represented in minority and low-income neighborhoods.

The online lender Enova looks at 68 different alternative data sources when it considers potential borrowers with no credit file. These include VantageScore, LexisNexis, telephone companies, and data aggregators (for bank account transaction information).

For small-business loans, accounting data, business checking account data, payment processing data, and social data for businesses that use Facebook are useful, said Kathryn Petralia, co-founder and head of operations at the online lender Kabbage.

"Engagement with their customers is a really strong predictor of performance, because if they're engaged with their customers then you know they're working to run their business," she said.

Kathryn Petralia, co-founder and president, Kabbage

Google Analytics can give a good view of website traffic and shipping data is helpful too, she said. "If a nail salon is getting more packages, that probably means they're doing better," Petralia said.

So what about "alternative data" is in question?

College controversy

The most controversial data type is education.

Many people believe that if you feed a credit model information about the college a loan applicant went to, you're likely to run afoul of disparate impact rules. Especially if you use artificial intelligence technology that finds its own correlations between factors like school and creditworthiness.

"Contrast a kid who just graduated from Tupelo Junior College to a graduate of Boston College, the outcome will be dramatically different," Ely said. "Not only will the loan approval rates be different, but I suspect the lender's offer will be very different — the kid from Boston College might get an offer for a $50,000 loan, the guy from Tupelo a $500 loan. Then you get into those kinds of disparate impacts."

Petralia said education is usually a proxy for affluence. "Not always — there are some kids from Harvard and Princeton who went there on scholarship and they come from economically disadvantaged backgrounds," she said. "But the preponderance of them didn't."

SoFi and Upstart are among the online lenders that today include college data in their underwriting models.

Girouard said Upstart considers education as just one of many elements of creditworthiness. But he maintained that Upstart's approach is less discriminatory than traditional underwriting models.

"You may intuitively believe that using education would cause disparate impact, but in our system the additional variables we look at reduce the disparate impact that is inherent in lending," he said. "Because FICO scores and income are correlated, they have bias embedded in them inherently. The additional data we use tends to level the playing field more than cause an uneven playing field."

Upstart measures the outcomes of its underwriting models for disparate impact with respect to gender and race.

"The bottom line is, the data demonstrates that we don't have disparate impact in our system," Girouard said.

Sarah Davies, senior vice president for research, analytics and product development at VantageScore, said the joint venture of TransUnion, Experian and Equifax doesn't consider education in its credit scores because of regulatory concerns.

"You have to be sure there's no disparate impact with these pieces of data," she explained. "For that reason, we don't use anything like address information. And in some part, student information is a proxy for address. It immediately creates a red flag for us."

"Because FICO scores and income are correlated, they have bias embedded in them inherently. The additional data we use tends to level the playing field more than cause an uneven playing field." —Upstart’s Dave Girouard

Davies also said it's hard for any company to say the use of a type of data isn't causing disparate impact, because the principle itself is complex.

"We've studied it for 10 years," she said. "There's lots of ways disparate impact can seep into these models, even when you're not using soft data like student information. So it's not as cut and dried that a model does or doesn't have disparate impact, depending on how it's used, the time it's being used, the type of products — you've got all these other overlays that make it a complicated question."

The other problem with education data, Davies said, is the need to provide a clear reason code for a low credit score or a decline on a loan.

"The reason codes are things like, you failed to make your payment on time or your utilization is too high, and all those things have to be related to risk," Davies said. "It's almost impossible to create a reason-code statement that says one person got a great score because they went to Harvard and another got a poor score because they went to Iowa State."

Social media data

The use of social media data is also debated. Here again, the reason codes required by the Fair Credit Reporting Act are an issue.

"If one of the reason codes is, you visited this website, can you imagine having to deal with that?" Ely said. "I know enough about how regulators think that this isn't going to be in our underwriting models for a long time."

Upstart does not use social media data currently.

"We're not into fanciful things like what you put on your Facebook page or who your friends are," Girouard said. "A lot of companies have done a disservice to alternative underwriting with fanciful ideas that are not grounded in anything."

Some lenders use social media to avoid fraud — for instance, if an applicant has no presence on social media and the associated email account was created a week ago, that could be an indicator of a synthetic identity.

In small-business lending, social media data has more practical use. It can reflect engagement with customers, as Petralia noted. It could also be used to spot signs of trouble. For instance, the lending platform provider Credibly takes in Yelp data to be alerted if a restaurant has had a management change or is closed. Those red flags get passed on to human underwriters.

Phone-use data

MyBucks, an online lender that does a lot of work in Africa, has built an AI system that can take in any data to do credit scoring.

Currently in Kenya, the company gathers data from Android smart phones, with the customer's consent, including potential borrowers' calling patterns, the duration of their calls, their cell phone bill payment history, geolocation and all payments made from the phone.

"It turns out that in small countries where mobile money is widely used, data from smartphones is a great source," said Richard van der Wath, chief data officer at MyBucks.

Phone-use data might not ever be acceptable in the United States.

"How would I say to a consumer, you made 10 phone calls to your mother, therefore you're arguably lower risk because you're a better child?" —VantageScore’s Sarah Davies

"The inappropriate data types are the ones that are generally used as proxies for things like age, gender or race," said Joao Menano, chief financial officer of the online lending platform provider James. "One that is particularly concerning is the use of your mobile data information, like SMS, WhatsApp messages and Facebook posts. It becomes quite easy to combine different variables that in practice are a proxy to race, for instance. I'm not saying that one should not use that data if it has predictive value; what I'm saying is that in those cases one should have extra caution to ensure fair lending practices."

Girouard said Upstart would not consider phone use in lending decisions unless it saw data proving a link to creditworthiness.

However, he noted that in Africa there are no credit bureaus and the phone is the only means of collecting data. "In the case of somebody who's lending in Africa, I wouldn't pass judgment on that, other than to say having some data and making credit available is valuable," he said.

Davies has trouble envisioning such data being used in the United States.

"How would I say to a consumer, you made 10 phone calls to your mother, therefore you're arguably lower risk because you're a better child?" she said. "How is this data indicating direct risk to a loan that's being made?"

Public records

In July, credit bureaus were forced to drop information about public records, specifically civil judgments and tax liens, from credit scores. This information was often incorrect.

According to LexisNexis Risk Solutions, a provider of public record data, part of the challenge was that it's hard to accurately align public record information with credit files. Lenders can still buy the public record data directly from the company.

"Certainly public-record information is valuable," Davies said. "It's indicative of payment behaviors and propensity to pay."

"There's lots of ways disparate impact can seep into these models, even when you're not using soft data like student information.”
— VantageScore's Sarah Davies

However, VantageScore ran a study with a credit scoring model from which it removed all public record information and added in other attributes such as very high balances on credit cards.

"That information was as predictive if not more predictive than the public record information," Davies said. "These public records were incurred several years ago, whereas if consumers run up high balances, there's a potential that they've gotten themselves into a more risky situation."

Such data is already included in some credit models, she said.

Facial recognition

MyBucks in Africa has been testing algorithms that try to determine if someone is lying based on their facial expressions. It hasn’t rolled this out yet in production; it’s still at the idea stage.

“We’ve discussed the possibility of using that for credit risks,” said van der Wath. “We’re doing the experiments on facial data. We’ll see, maybe it will correlate. It’s an interesting field. When clients are financially excluded, there’s no credit bureau data, what alternative data can we use?”

MyBucks is also testing the use of facial recognition to find fake IDs and authenticate customers. It would combine facial recognition with behavioral biometrics — for instance, how quickly someone completes their registration. (If a person is impersonating someone else, they’re likely to copy and paste from a list, which has a different cadence than typing.)

Girouard can’t see ever using facial data in lending decisions. “At some point you need to be able to have verifiable training data,” he said. “We never make assumptions about what’s predictive and what’s not. We rely on what the data tells us.”

Likewise, Davies does not envision the use of this data anytime soon, yet she sees the potential. "That’s my own shortsightedness and lack of imagination," she said. "Let’s talk 10 years from now. You can see how being able to trigger facial expression that says this person has no intention of paying the debt even though they say they are would be very powerful. It’s probably too far away from the reality of our data platform today.”

Penny Crosman

Executive Editor, Technology at Arizent, American Banker

	About Penny
twitter	pennycrosman
mailto	penny.crosman@arizent.com
linkedin	pennycrosman