Conventional data masking tools can't process or analyze unstructured data because there's no predefined data model. Modern entity-based data masking can.
Table of Contents
Data Masking = Data Anonymization
Data masking (aka data anonymization) protects sensitive data by creating a version of it that cannot be identified or reverse-engineered. Data masking tools modify sensitive information in such a way that it becomes worthless to unauthorized users, but retains its value within the software applications. Effective data masking software ensures data consistency and usability across multiple databases, reporting, and analytics platforms.
One of the biggest challenges and priorities for companies is finding a way to identify, report on, and mask sensitive data within unstructured data files, or “dark data.” Therefore, it’s critical to apply data masking to unstructured data, where Personally Identifying Information (PII) and other sensitive data might reside.
In this article, we will cover the most important information for you to know about unstructured data masking, including how a business entity approach enables dynamic data masking while maintaining relational integrity.
Structured Data vs Unstructured Data
There are 2 primary types of data residing in organizations: structured and unstructured. Here’s a breakdown of the main differences.
Structured data, which may also be categorized as quantitative data, is highly organized data that fits into a predefined data model. Structured Query Language (SQL) is the programming language used to manage structured data, and with an SQL database, business users can easily input, search, and manipulate structured data. Structured data can also be easily deciphered by Machine Learning (ML) algorithms.
Common structured data use cases include:
CRM analytics tools, that reveal meaningful customer trends and behaviors
Online booking platforms, that record predefined reservation data, such as dates, prices, and destinations
Accounting software, that processes and records financial transactions
Unstructured data, which is often categorized as qualitative data, cannot be processed or analyzed by conventional data tools. Since unstructured data doesn’t adhere to a predefined data model, it must either be managed in a non-relational (NoSQL) database, or in a data lake, to preserve it in raw form.
It’s estimated that by 2025, up to 90% of all enterprise data will be unstructured. Sensitive unstructured data can be found within images, PDF contracts and agreements, driver licenses, XML documents, chats, and more. It is often stored on file shares, content management systems, as well as BLOBs or CLOBs within databases.
Common use cases for unstructured data include:
Unstructured data mining, to monitor consumer behavior and purchasing patterns
Predictive data analytics, to better anticipate and adapt to shifts in the market
Chatbots, to analyze text and route customer questions to the most relevant sources
The Importance of Unstructured Data Masking
With so much sensitive information floating around unstructured data files, effectively managing this data is crucial to both security and compliance efforts.
Unstructured data masking is integral to managing the following data governance needs:
Complying with data privacy regulations is becoming more complex, while the penalties for noncompliance are becoming more severe. Highly regulated industries, such as financial services and healthcare, are already compelled to comply with regulations such as the Payment Card Industry Data Security Standard (PCI DSS), and the Health Insurance Portability and Accountability Act (HIPAA). Following the enactment of GDPR in 2018, similar data protection laws are rapidly emerging the world over. Unstructured data masking supports compliance while still enabling operational and analytical workloads.
In addition to the persistent threat of external hackers, insider threats also pose a significant risk. Insider breaches often occur due to broad access to enterprise systems by employees and third-party contractors. Production systems are especially vulnerable, because sensitive data is often used in development, testing, and other pre-production environments. When PII, financial, medical, or any other type of sensitive information becomes anonymized data, it’s no longer a liability if accidentally exposed by business users, forgotten in testing environments, or hacked from the outside by malicious actors.
Effective data governance tools are fundamental to maintaining data consistency and integrity across the organization. Controlling access to data is a primary component of data governance – and something that masking is very concerned with. While static data masking obscures a single dataset, dynamic data masking enables more granular levels of control as well as protecting data in transit. Only authorized users with appropriate permissions can access unmasked data.
Structured and unstructured data masking allows organizations to identify, monitor, and protect sensitive data, while maintaining data consistency, integrity, and usability across the enterprise.
Unstructured Data Masking via Business Entities
Unlike conventional data masking solutions, the entity-based data masking technology delivers all of the data related to a specific customer, order, payment, or device – to authorized data consumers – in real time, and at massive scale. The data for each instance of a business entity is stored and managed in its own individually encrypted Micro-Database™, making a mass data breach impossible.
Entity-based data masking obscures unstructured data on the fly, while maintaining relational integrity throughout the organization. Images, PDFs, text files, and more are protected. For example, the platform replaces real photo IDs with fake ones, locates and obfuscates PII and sensitive data, and can synthetically generate digital versions of receipts, checks, contracts, and other sensitive data files, for testing and analytics purposes.
A business entity approach to data masking ensures referential consistency across both structured and unstructured data.
When a single platform automatically discovers and masks sensitive data across all systems, data governance, security, and compliance are far more manageable.