BankThink

AI-assisted code creates new kinds of risk for payment infrastructure

Published July 03, 2026, 7:30 a.m. EDT

5 Min Read

Key insight: As AI-assisted coding becomes more common and economically viable, the burden on automated testing suites increases.
What's at stake: The limiting factor in payment infrastructure will not be whether teams can produce more code. It will be whether they can prove that sensitive logic still behaves safely across the messy reality of production traffic.
Forward look: Banks and payments processors need to adopt new methods of verifying that changes are safe and stable

Banks are paying very close and justified attention to the use of artificial intelligence in customer service, fraud, underwriting and operations. Along with those, another AI-related risk is quietly taking root in engineering organizations. Software can now be changed faster than existing processes can evaluate the consequences of those changes.

Processing Content

AI-assisted coding enables engineering teams to write code much faster. This could be refactoring old systems, migrating services and generating candidate implementations faster than before. The cost of doing all these things has reduced drastically. That is very valuable. But in payments infrastructure, speed of code generation is not the only bottleneck. The next bottleneck is being confident about the correctness of the change. A change can pass ordinary tests and still alter behavior for a narrow but important segment of transactions.

For example, payment systems are notoriously difficult to test because behavior often results from the complicated interaction of many conditions. A single transaction may be affected by network rules, merchant configuration, card product, transaction type, region, authentication method, rate tables, effective dates, feature flags, and more. Each of these may be understandable in isolation, but the broader system-level behavior is much harder to reason about, and thus much harder to test.

Unit tests can validate individual rules. Integration tests can validate representative flows. While both are necessary, they do not answer the question that matters most before shipping a high-risk change: What would this implementation have done across the actual transaction history already seen? The traditional testing stack falls short in taking a holistic view. It lacks the comprehensiveness needed to confidently roll out changes at the global scale, such as payments or other money movement-related systems.

As AI-assisted coding becomes more common and economically viable, the burden on automated testing suites increases. AI-written code is not inherently worse, and in many cases it may be better. The real concern is that production-impacting code changes can now be produced faster, while evaluating their impact on the long tail of nuanced payment behaviors remains slow and expensive. Most test suites are not designed for the sheer throughput of changes that AI-assisted coding can generate. If the evaluation process does not improve, it derails the whole efficiency value proposition, as end users do not get the benefits of changes faster.

The good news is that there are some solutions most organizations can readily adopt. One practical answer is replay-based testing. Rather than relying only on synthetic data that is usually used in unit or integration tests, new logic can be tested against historical transaction data. And the cheapest, fastest and linearly scalable way to do so would be via an offline data pipeline. Such a replay system will have to reconstruct the relevant context for each transaction, and then run both the current implementation and the candidate implementation to compare the outputs.

This result goes beyond a simple pass-fail signal by capturing transaction-level differences between the current and candidate implementations. These results can be deeply analyzed to see which granular outputs changed, which segments were affected, what the aggregate impact of the changes is, which rules produced the difference, and whether the change was expected. For example, in network cost-related systems, the comparison might show whether a new implementation impacts network fees for a particular type of merchant, region or card product. For routing or eligibility systems, it might show whether a rule change moved transactions into a different decision path.

Artificial Intelligence

Anthropic is restoring Fable 5, but questions linger

The company has built even more safeguards into the model, including a fallback to Opus 4.8 for requests that trigger security controls.

By Penny Crosman

July 1

Such testing is particularly useful for systems that affect pricing, billing or merchant reporting. Even a small logic change impacting a tiny percentage of volume can still matter if it touches a large customer, a regulated product, a high-value transaction type, or a sensitive customer-facing outcome. These are exactly the cases that are hard to discover through manually curated testing suites alone.

Replay testing also helps with ordinary configuration changes, not just AI-assisted changes. A lot of factors impact the system configurations. Card networks update rules. Banks change product logic. Reporting or regulatory requirements can change. All of these changes require confidence that the new configuration behaves as intended. Historical transaction data provides a realistic baseline and test cases for reasoning through such changes.

Another common scenario is the need for engineering teams to replace old systems with new ones. For such a big migration, such testing mechanisms become even more critical to ensure behavioral stability. Further, these migrations are the best opportunity to architect and organize code-bases with a discipline that enables offline data jobs based on replay testing. Such investments pay off very quickly during migration itself by proving the correctness of migration against historical traffic. Once this infrastructure is in place, it comes in handy for many more scenarios during a service's lifecycle.

However, this approach has limitations, as historical data cannot predict all future scenarios. A replay is only useful to the extent it can reconstruct the right state needed to execute the core logic. Thus, not every type of business application is suited for such a testing path. Strong privacy controls, data minimization and governance around production-derived datasets are needed to make this work. Thankfully, most banks and financial institutions have those safeguards in place already.

The broader lesson is an important one, though. As software generation becomes cheaper, behavioral evaluation becomes more valuable. The limiting factor in payment infrastructure will not be whether teams can produce more code. It will be whether they can prove that sensitive logic still behaves safely across the messy reality of production traffic.

Banks should not treat AI-assisted coding as something to avoid. Rather this is an opportunity to modernize systems, provided enough safeguards are in place. Replay-based testing is one such safeguard. Used carefully, it can help modernize aging infrastructure and reduce engineering bottlenecks. But institutions need controls that match the new speed of change. For payment systems, that means turning historical transaction data into a reusable testing asset.

Vivek Yadav

Engineering manager, Stripe

Vivek Yadav is an engineering manager at Stripe.