- Key Insight: Strict data lineage is now central to bank generative AI strategies.
- What's at Stake: Operational, compliance and reputational risks could translate into lawsuits and financial losses.
Forward Look: Expect tighter governance and integrations between lineage platforms and LLM providers.
Source: Bullets generated by AI with editorial review
As more banks deploy generative AI, they're paying more attention to the data being used to feed those models, to make sure it's accurate, relevant and comes from a trusted source. This calls for data lineage: maintaining a detailed record of the life cycle of data and showing its entire journey from its original source to its final destination, including any changes made to it along the way.
"Data and AI come very tightly coupled, because it's quite hard often for AI deployment to be successful without the trusted data that you need for it to be successful," Andrew Foster, chief data officer at
Without proper data lineage and data governance inside a company, much can go wrong with generative AI. An
This turned out not to be so, and the airline refused to give him the discount. A civil resolution tribunal decided that Air Canada was responsible for all the information on its website, whether it came from a static page or a bot, and it ordered the airline to give the customer the discount and pay fees.
For banks, there are compliance, operations and reputation risks of failing at or overlooking data lineage, according to John Ratzan, senior managing director at Accenture.
"The worst that can happen is that it could lead to lawsuits, diminished brand reputation and a negative impact on company financials," Ratzan told American Banker.
M&T's gen AI journey
Like other banks, when large language models first came out,
Foster's attitude softened a couple of years ago, and he started vetting large language model providers, looking for "a stable, strong partner that is used to delivering complex technology into other complex institutions."
He chose Microsoft Copilot. Today, 16,000 of the bank's 22,000 employees use the gen AI model for first drafts of emails and reports, and to summarize call center conversations.
"For anything involving capturing and using and interrogating text, it's a starting point," Foster said. Generative AI can also interrogate SQL databases, he noted.
In most such use cases, "gen AI gets you 60% of the way, then a human reviews it and takes it the other 40%," Foster said.
The benefit is an "uplift in human efficiency, which is obviously useful," Foster said. "It makes everyone's work better, faster, stronger." Having generative AI summarize calls, for instance, saves about six minutes per call.
Employees quickly grow fond of the tools, according to Foster. At one point,
But he also noted one challenge of large language models: the problem of having multiple right answers.
"If you ask Copilot, help me craft an email or help me craft a press release, you could get three different versions, and each of them is right for its own version of rightness," he said. "So we've put human decision-making, critical thinking, at the center of AI adoption. You're not deferring your own judgment to the machine through the adoption of Copilot. It's giving you more tools to be effective, but the human being retains that accountability."
Building data lineage
When Foster arrived at
"This wasn't in response to gen AI," Foster said. "I saw it as a core capability: Do we know where our data comes from and how we use it, how do we bring it to a level where we can interrogate it, how all the data goes from point A to point B?"
His team created a repository called Edison that contains authoritative documents and data on all bank policies.
The bank deployed data lineage software from Solidatus and from Monte Carlo. The Solidatus software speeds up the production of data lineage, Foster said. It also provides a single repository for the bank's data, which enables interrogation and analysis that before would not have been possible. It's helping to make
Solidatus integrates with databases and applications, and it retrieves metadata and lineage from within them, explained Tina Chace, vice president of product at Solidatus.
"When we read an Oracle database, we look at things like the schema, the tables, the structure, but in order to generate the lineage, we also look at the stored procedures within a database," Chase told American Banker. "We have a tool that reads the stored procedures and then understands how the data flows throughout the database."
Solidatus also works with business intelligence tools like Microsoft Power BI and Tableau. "When we look at Power BI or Tableau, we look at the data models, the physical data sources and the logical data sources and reports that are captured within that business intelligence tool, and we're able to pull that in," Chase said.
The most challenging technology for the data lineage software to work with is mainframe applications, she said. "We have an integration that sits near the mainframe applications and pulls out languages like COBOL and interprets them to be able to capture the truthful representation of how data flows within that technology." Solidatus can import semi-structured information as well, such as CSV, XML and JSON files.
"We have transparency over lineage, quality and governance,"
Though foundational large language models like Microsoft Copilot, Google Gemini, OpenAI's ChatGPT and Anthropic's Claude have been trained on everything on the Internet, which has raised questions about data lineage and potential copyright violations for those companies, those issues are not relevant to
The bank uses a process known as retrieval-augmented generation to limit the data the generative AI models are trained on to internal, governed data.
Foster would like to see Solidatus work with Microsoft, so that the data lineage that starts in business units persists through Copilot. A Solidatus representative said the company does have an API that could be used for this purpose. It's also developing a model context protocol server that would make such integrations easier.
"We fully expect more value will come from the future integration," Foster said.
Foster acknowledges there are risks to having bank employees increasingly rely on generative AI.
"You need to embrace adoption because you become more efficient. If you don't, you fall behind peers who are using it," he said. "But you need to have responsible usage and retain accountability."
All these efforts are in line with Ratzan's best-practice recommendations for clients.
He said there are technical approaches to enhance data lineage, such as automating metadata capture and traceability of data going into and emerging out of gen AI models. Clear policies and training to reinforce those policies are very important, he said.
"Data governance broadly is the key for data lineage," he said. "Clear ownership of the data and accountability for who manages the data, the source and the consumption are paramount."