BankThink

Synthetic data can be a real solution for analysis and security

Published December 22, 2020, 12:01 a.m. EST

Updated June 01, 2021, 1:21 p.m. EDT

5 Min Read

In the midst of the COVID-19 pandemic, we are constantly being bombarded with graphs and tables showing the impacts of coronavirus to date and predicting the levels of infections, hospitalizations and deaths over the coming months.

Processing Content

These charts and graphs reported in the news are based on a mix of actual data reported by hospitals and governments as well as synthetic data created from the reported data. Epidemiologists and other data analysts extrapolate out the actual data, based on analysis and assumptions, into a synthetic data set that is used to build the projections we are all so eagerly following in order to find out when we can start returning to our normal lives.

Data intelligence is critical not only to managing the health care impacts of COVID-19, but just as importantly, for informing businesses navigating disaster planning, business continuity, product development and client servicing during this crisis.

Businesses need broad access to both internal and external data to understand the financial risks to the business as well as the health risks to their employees and customers, to make prudent and practical decisions in order to survive the (hopefully) short-term revenue losses and to plan for the changes in business practices that will likely be with us for a long time to come. Access to comprehensive data is more critical than ever; and ensuring that the data is secure has never been more important as in this time when our vulnerability — in terms of our health, our finances, our overall safety — is the highest it has ever been.

Synthetic data is an artificial data set that mimics the original data; however, it removes the personal or other sensitive information that may be included in the original data. Raw data is run through special algorithms and generators to create new data sets that cannot be traced back to the original consumer or transaction. However, this “fake” data set retains the accuracy and statistical significance of the original data set, making it ideal for creating a baseline for future studies or testing, modeling business opportunities, projecting trends and more.

For researchers and scientists tracking the COVID-19 crisis and working to develop treatments and vaccines, synthetic data can be used to aid in the creation of a much larger baseline for testing and clinical trials. For business or product owners, synthetic data can be generated on a one-to-one basis so that the final synthetic data set matches the original set field-for-field, but without the privacy risks. This “new” data set can then be safely used for performance analysis, benchmarking, forecasting or product development, producing results as valid as using the original data and at no risk of misuse of personally identifiable information.

Data collected from hospitals and health departments have been critical inputs into understanding the health impacts of the COVID-19 pandemic, but they only tell part of the story of the changes COVID-19 has created in our lives. As businesses, retailers and restaurants have been reopening, government officials and public health personnel are using reported infections data to track new outbreaks, trace contacts and make plans to manage the evolving situation. Businesses themselves need data to understand when and how to reopen their doors, how consumer needs are changing and how best to confidently and competently serve a client base whose interactions and purchasing behavior are now very different from just a few months ago. And consumer transactional data should not be evaluated separately from pandemic trend data; rather, the two need to be combined so that operational decisions are informed from both health and business perspectives to ensure the economy is being reopened in the safest and most beneficial manner possible.

One example of an organization enabling broad data sharing across industries is Safegraph, which has formed an interdisciplinary consortium with over 1,000 organizations, including the CDC, major academic research institutions, transaction data providers and government organizations at all levels. The consortium’s mission is to support COVID-19 response efforts by sharing aggregated and anonymized — and some synthesized — data on social distancing, foot traffic and consumer spending to retail establishments. The comprehensive data sets the Safegraph consortium is collecting from thousands of retail chains and millions of consumers and small businesses provide critical input for government agencies in managing the broader economic recovery, as well as for financial institutions planning for the hoped-for rebound in consumer spending.

Combining health data with economic data allows us to construct models for guiding reopening planning, identifying businesses by level of criticality and economic value and balancing this against the level of health risk posed to customers of those types of businesses. Below is an example of one such model: a reopening road map created by Facteus based on consumer spending data from over 1,000 banks and input from 26 health analysis sources. Depending on which quadrant a business falls into and the infection levels in a local area, government officials and business owners can use the roadmap to make better-informed decisions as they work to reopen their local economies.

Safe, accessible and comprehensive data is critical to getting our economy going again. Synthesizing data allows for broad sharing of the inputs businesses and municipalities need to make decisions, all with much reduced levels of concern by health care professionals, government officials, business owners, compliance officers and PR staff about the risks of personally identifiable information being misused or stolen.

Imagine how useful a synthetic data set would be to a product manager or business owner. With easily accessible, rapidly updatable and statistically valid synthetic data, a product manager could be much more proactive in responding to customer issues, predicting future product trends or generating ideas for new product features based on deeper analysis of product usage and customer feedback. All of this with much reduced levels of concern by business owners, compliance officers and PR staff about the risks of personally identifiable information being misused or stolen. Synthetic data can also be used to accurately train machine learning models and neural networks, critical for such areas as fraud detection and management systems, which need mountains of reliable data for testing and strengthening.

Ginger Schmeltzer

Senior analyst

Ginger Schmeltzer is a senior analyst at Aite Group.