Information Governance in a Data Analytics Environment

Published November 10, 2016, 2:32 p.m. EST

2 Min Read

In heavily regulated industries like banking, much of the interesting data that could feed valuable analytics insights is controlled by local, national, or federated regulations. These regulations and laws have grown up and accumulated over time, and they now define and control key information created and used by these industries. Regulated data might be application-generated (like transactional data) or human generated (like communications, or know-your-customer information), and consequently, it may be structured data or unstructured data. This data has two things in common across these industries: it tends to be the most important, most interesting data in the enterprise, and it’s covered by ever more stringent regulations.

Processing Content

Companies must pay very close attention to these laws and requirements to ensure they meet them and can prove that they meet them. If they fail to comply with these regulations – or fail to provide adequate proof of their compliance – they can face fines, penalties, and even government actions that can include shutting down part or all of the company’s business. The government regulations they must follow restrict everything including who can access that data, when and how it needs to be stored, where it must reside (data sovereignty), and when and how, or even if, it can be modified or deleted.

In contrast, typical analytics environments have been designed expressly to simplify and improve access to the data. This design aesthetic has vastly increased the speed and volumes at which analytics can be performed. However, as a result of their fundamental intent of getting to data faster and more easily , these aren’t environments that have been designed for control of data access, data retention, or data segregation – all things necessary for managing a regulated data environment in a compliant way.

So how can organizations manage these two competing priorities to get meaningful insights from heavily regulated data? Recently, companies have tried a number of different partial fixes, including:

Small, siloed data lakes that provide limited insight to carefully portioned data. This can fulfill requirements around who can access the data, data sovereignty, and data segregation requirements, but falls far short of the full potential of analytics that are run across large volumes of data from many different sources.
Archived data duplicated into a separate Hadoop environment. This can meet retention and legal hold requirements by keeping the archive tightly controlled and the analytics environment free of controls, but it results in duplicated data and has raised potential concerns about access control and privacy requirements.
Hadoop distributions that are beginning to include information governance like retention, access control, etc. This can meet some requirements but significantly increases management complexity and potentially slows down the analytics environment. It also doesn’t cover jurisdictional issues of where data must reside, and may not meet all data privacy and security requirements.

Organizations require a solution that provides a compliant environment with robust information governance to meet global regulations regarding retention, legal hold, access control, security, privacy, sovereignty, and more.