Data masking is the process of replacing sensitive data with fake data, while retaining relational integrity and usability of the data, and ensuring that the original data cannot be reverse engineered.
Data Masking: An Imperative for Today’s Enterprises
With the proliferation of personal data – collected by enterprises across all industries – the need for protecting individual privacy is paramount. One way to protect Personally Identifiable Information (PII) is by masking data (i.e., consistently changing names, or including only the last 4 digits of a credit card or Social Security Number).
This reference guide explores today’s data masking techniques, the challenges they pose for enterprises, and a novel approach, based on data products, that addresses these challenges in the most comprehensive manner.
Table of Contents
What is Data Masking?
Data masking is a method for protecting personal or sensitive data that creates a version of the data that can’t be identified or reverse-engineered while retaining referential context, relational integrity, and usability.
The most common types of data that need masking are:
PII: Personally Identifiable Information, in response to privacy regulations, such as GDPR and CPRA
PCI-DSS: Payment Card Industry Data Security Standard (payment card information)
PHI: Protected Health Information
IP: Intellectual Property
Data masking best practices call for its use in non-production environments – such as software development, data science, and testing – that don’t require the original production data.
Simply defined, data masking combines the processes and tools for making sensitive data unrecognizable, but functional, by authorized users.
Data Masking vs Encryption and Tokenization
Data anonymization refers to a variety of processes that transform data into another form, in order to secure and protect it. The 3 most common data anonymization methods are data masking, data encryption, and data tokenization. While data masking is irreversible, encryption and tokenization are both reversible in the sense that the original values can be derived from the obscured data. Here’s a brief explanation of the 3 methods:
Data masking tools substitute realistic, but fake, data for the original values, to ensure data privacy. Development, support, data science, business intelligence, testing, and training teams use masked data in order to make use of a dataset without exposing real data to any risk.
There are many techniques for masking data, such as data scrambling, data blinding, or data shuffling, which will be explained in greater detail later on. The process of permanently removing all Personally Identifiable Information (PII) from sensitive data is also known as data sanitization. There is no algorithm to recover the original values of masked data.
While data encryption is very secure, data teams can’t analyze or work with encrypted data. The more complex the encryption algorithm, the safer the data will be from unauthorized access. In a data masking vs encryption comparison, encryption is ideal for storing or transferring sensitive data securely.
Data tokenization, which substitutes a sensitive data element with random characters (tokens), is a reversible process. The tokens can be mapped back to the original data, with the mappings stored in a secure “data vault”.
In a data masking vs tokenization comparison, tokenization supports operations like processing a credit card payment without revealing the credit card number. The real data never leaves the organization, and can’t be seen or decrypted by a third-party processor.
Data tokenization supports the Payment Card Industry Data Security.
The fact that data masking is not reversible makes it more secure, and less costly, than tokenization.
Data masking software maintains referential context and relational integrity across systems and databases, which is critical in software testing and data analysis.
In the case of anonymized data, integrity means that the dataset maintains its validity and consistency, despite undergoing data de-identification. For example, a real credit card number can be replaced by any 16-digit value that is validated by the “CheckSum” function. Once anonymized, the same (new) value must be used consistently across all systems.
In short, there are 2 major differences between data masking and encryption/tokenization:
Masked data is usable in its anonymized form.
Once data is masked, the original value can’t be recovered.
Why Data Masking?
Data masking solutions are important to enterprises because they enables them to:
- Achieve and maintain compliance with privacy laws, like CPRA, GDPR and HIPAA, by eliminating the risk of exposing personal or sensitive data.
- Protect data from cyber-attacks, while preserving its usability and consistency.
- Reduce the risk of data sharing, e.g., in the case of cloud migrations, or when integrating with third-party apps.
While data masking tools have been around for decades, they're now needed more than ever to effectively safeguard sensitive data, and to address the following challenges:
Highly regulated industries, like financial services and healthcare, already operate under strict privacy regulations, including the Payment Card Industry Data Security Standard (PCI DSS), and the Health Insurance Portability and Accountability Act (HIPAA). Since the introduction of Europe’s GDPR in 2018, there has been a proliferation of privacy laws across the globe including CPRA in California, LGPD in Brazil, and PDPA in the Philippines and Singapore. Such privacy laws seek to protect Personally Identifiable Information (PII) by restricting access to it whenever possible.
Many employees and third-party contractors access enterprise systems on a regular basis. Production systems are particularly vulnerable, because sensitive information is often used in development, testing, and other pre-production environments. With insider threats rising 47% since 2018, according to the Ponemon Institute report, protecting sensitive data costs companies an average of $200,000 per year.
In 2020, personal data was compromised in 58% of the data breaches, states a Verizon report. The study further indicates that in 72% of the cases, the victims were large enterprises. With the vast volume, variety and velocity of enterprise data, it is no wonder that breaches proliferate. Taking measures to protect sensitive data in non-production environments will significantly reduce the risk.
Data masking is commonly used to control data access. While static data masking obscures a single dataset, dynamic data masking provides more granular controls. With dynamic data masking, permissions can be granted or denied at many different levels. Only those with the appropriate access rights can access the real data. Others will see only the parts that they are allowed to see.
Data masking is highly customizable. Data teams can choose which data fields get masked, and how to select and format each substitute value. For example, every Social Security Number (SSN) has the format xxx-xx-xxxx, where “x” is a number from 0 to 9. They can substitute the first five digits with the letter x, or all 9 numbers with other random numbers, according to their needs.
Data Masking Best Practices
Over time, a variety of data masking best practices have been devised to maximize data protection, while minimizing data exposure:
Static data masking
Non-production environments, such as those used for analytics, testing, training, and development purposes, often source data from production systems. In such cases, sensitive data is protected with static data masking, a one-way transformation ensuring that the masking process cannot be undone. When it comes to testing and analytics, repeatability is a key concept because using the same input data delivers the same results. This requires the masked data values to persist, over time, and through multiple extractions.
For software testing, static data masking is usually employed on a copy of a production database as part of a test data management solution. It makes data look real enough to enable software development and testing, without exposing the original data.
Dynamic data masking
Dynamic data masking is used to protect, obscure, or block access to, sensitive data. While prevalent in production systems, it is also used when testers or data scientists require real data. Dynamic data masking is performed in real time, in response to a data request. When the data is located in multiple source systems, masking consistency is difficult, especially when dealing with disparate environments, and a wide variety of technologies. Dynamic data masking protects sensitive data on demand.
Dynamic data masking automatically streams data from a production environment, to avoid storing the masked data in a separate database. As a rule, it’s used for role-based security for applications – such as handling customer queries, or processing sensitive data, like health records – and in read-only scenarios, so that the masked data doesn’t get written back to the production system.
On-the-fly data masking
When analytics or test data is extracted from production systems, staging sites are often used to integrate, cleanse, and transform the data, before masking it. The masked data is then delivered to the analytics or testing environment. This multi-stage process is slow, cumbersome, and risky due to the possible exposure of private data.
On-the-fly data masking is performed on data as it moves from one environment to another, such as from production, to development or test. It’s ideal for enterprises engaging in continuous software development and large-scale data integrations. A subset of the masked data is generally delivered to authorized users upon request, because keeping a backup of all the masked data is inefficient and impractical.
Statistical data masking
Statistical data masking ensures that any masked data retains the same statistical characteristics and patterns as the real-world data it represents – such as the distribution, mean, and standard deviation. When production data is replaced by randomized values, unauthorized users have great difficulty extracting any information of value afterwards.
Test data masking
Applications, of any kind, require extensive testing before they can be released into production. Test data management tools that provision production data for testing must mask the test data to protect sensitive information. For example, in a legacy modernization program, the modernized software components must undergo continuous testing, making test data masking a key component in the testing process. Masking data with referential context and relational integrity – from production systems, to the test environments – is critical.
Unstructured data masking
When it comes to protecting data privacy, regulations do not differentiate between structured and unstructured data. Scanned documents and image files, such as insurance claims, bank checks, and medical records, contain sensitive data stored as images. Many different formats (e.g., pdf, png, csv, email, and Office docs) are used daily by enterprises in their regular interactions with individuals. With the potential for so much sensitive data to be exposed in unstructured files, the need for unstructured data masking is obvious.
in the financial services and healthcare industries.
Data Masking Techniques
There are several techniques associated with data masking, including:
Scrambling randomly orders characters and/or numbers to obscure the original content. For example, when a shipment with tracking number 572918 in a production environment undergoes character scrambling, it might read 125879 in a different environment. Although easy to implement, scrambling can only be used on certain data types, and is not as secure as other techniques.
Nullifying applies a null value to a data column so that unauthorized users won’t be able to see the actual data in it. Despite its ease of implementation, nullifying results in data with less integrity and usability, which is often problematic in development and testing environments.
Substitution, which replaces the original data with another value, is one of the most effective data masking techniques because it preserves the original nature of the data. Although difficult to execute, substitution can be applied to several types of data, and is excellent protection against data breaches.
Like substitution, shuffling uses the same individual masking data column for randomly ordering characters or numbers. For example, when patient name columns are shuffled across multiple patient records, the results look accurate but don’t reveal any personal medical information. However, anyone with access to the shuffling algorithm can reverse-engineer the process.
Data/number variance is used for masking important financial and transaction date information. For example, masking the employee salaries column with the employee salary variance, displays the salaries between the highest- and lowest-paid employees. Data integrity can be assured by applying a variance of, say, +/- 5% to all salaries in the dataset.
Date aging increases or decreases a date field based on a pre-defined data masking policy, within a specific date range. For example, decreasing the date of birth field by 1,000 days would transform the date 1-January-2023 to 7-April-2020.
Data Masking Challenges
Not only must the altered data retain the basic characteristics of the original data, it must also be transformed enough to eliminate the risk of exposure, while retaining its integrity.
Enterprise IT landscapes typically have many production systems, that are deployed on premises and in the cloud, across a wide variety of technologies. To mask data effectively, an organization needs to:
- Discover the sensitive data and PII that require protection.
- Resolve identities to ensure the data integrity across systems. For example, If Rick Smith is masked as Sam Jones, that identity must be consistent wherever it is used.
- Comply with company governance policies for role, location, and permissions-based data access.
- Scale for real-time access and mass-batch data extraction.
- Manage the growing volumes of structured and unstructured data.
Data Masking by Business Entity
Entity-based data masking technology allows for the ingestion, organization, masking, and delivery of trusted data from disparate systems by business entity (customer, order, device, or anything else that’s important to the business).
By masking the data for a particular business entity as a singular unit, regardless of the underlying source systems and their technologies, relational integrity of the masked data is maintained – assuring that the masked data for that entity is always consistent and complete.
The PII for a business entity is masked in-flight, according to predefined business rules.
Entity-based data masking supports dynamic data masking for operational use cases, like customer 360, and static data masking, for software testing and analytical workloads.
referential integrity and consistency are ensured.
Benefits of the business entity approach
Using no-code data orchestration tools, data from multiple production systems is integrated, cleansed, and masked on the fly.
A business entity approach to data masking simplifies complexity, ensuring that an individual's customer data, which is fragmented across multiple sources is:
Consistent, across multiple sources
Persistent, over time and multiple extractions
Preserved, with referential integrity and formatting
Dynamic data masking
Dynamic data masking transforms, obscures, or blocks access to sensitive information fields based on any number and combination of parameters, such as user roles, privileges, and geographic location.
A broad set of data masking functions are dynamically invoked to protect the data.
Unstructured data masking
Protect unstructured data including images, PDFs, XML, CSV, text-based files, and more, with static and dynamic masking capabilities. Unstructured data masking lets you:
- Replace sensitive photos with fake alternatives.
- Use Optical Character Recognition (OCR) to detect content and enable intelligent masking.
- Employ synthetic data generation, to create digital versions of receipts, checks, contracts and other items for testing purposes.
Extensive and extendible masking functions
Entity-based data masking tools come with a comprehensive library of prebuilt masking functions, designed to provide realistic, but fake, data.
The table below highlights a few examples, including masking that creates a valid Social Security Number (SSN), selecting arbitrary names from name directories, as well as generating random numbers, and address-based zip codes. The library can be easily extended by custom Java functions that implement additional masking functions.
|SSN/National ID||Generate valid SSN|
|Credit card||Generate valid number based on card type|
|First name/Last name/Zip code||Select from collection based on geographic location|
|DOB||Shuffle (preserve statistical diversity)|
|Any String/number||Random string/number|
|Concatenation based on new first and last name|
|Constant||Static masking based on a pre-provided value|
|Address||Based on the provided zip code|
|Zip code||Configured by business policy (e.g., 1-mile radius, or citywide)
Data masking has become a pillar technology that global enterprises use to comply with privacy protection regulations.
Although the practice of masking data has been around for years, the sheer volume of data – structured and unstructured – and the ever-changing regulatory environment, have increased the complexity of data masking at enterprise scale.
Also, different data sources (such as legacy silos, mainframe, SAP and noSQL databases) make data masking incredibly complex. Add to that the complexity of each vendor offering their own data masking tools.
The offerings of the current data masking vendors are proving to be insufficient. However, an innovative business entity approach is setting the data masking standard at some of the world’s largest organizations.
K2view entity-based data masking tools allow enterprises to consolidate all their data extraction and masking efforts under one roof, dramatically reducing the complexity and cost (hardware, software, vendor masking licenses, and human resources needed) involved in managing all these data masking efforts separately.
What is data masking?
Data masking protects sensitive information in a database by replacing it with a disguised version of the data. Data masking ensures that any Personally Identifiable Information (PII) remains secure and confidential, but permits the data to be used for development, testing, and other legitimate purposes. By masking data, enterprises assure data privacy and security, reduce the risk of mass data breaches, and comply with privacy regulations such as the GDPR, CCPA, and HIPAA.
Why is data masking important?
Data masking is important to organizations for the following reasons:
- Data security: If masked data is breached, the original information it replaced is kept safe and protected.
- Test data management: Masking data is a safe alternative to using real production data needed for test data management tools.
- CI/CD: Continuous Integration / Continuous Delivery (CI/CD) in DevOps requires clean and usable data on demand. Dynamic data masking allows DevOps teams to quickly provision, use, and test new applications.
- Compliance: As the amount and strictness of data privacy regulations grow, data masking and synthetic data generation play a critical role in data-intensive companies.
- Customer 360: By masking data associated with Customer 360 use cases, companies gain access to representative and insightful data to enhance customer experiences, while protecting PII.
- Third-party protection: Enterprises must mask any data that’s processed by, or that integrates with, third-party vendors to preempt any breaches in the supply chain
What are the different types of data masking?
The different types of data masking include:
- Data anonymization, which removes or encrypts the personal data found in a dataset.
- Data pseudonymization, which substitutes PII, such as a name or Social Security Number, with a fake name or figures.
- Encrypted lookup substitution, which provides a table indicating realistic alternative values to personal data.
- Redaction, which replaces personal data with generic values in testing and development environments.
- Shuffling, which randomly inserts other masked data instead of substituting data with generic values.
- Data aging, which applies policies to each data field to conceal the true date. For example, you can set back the dates by 150 or 1,700 days, to maximize concealment.
- Nulling out, which gives a null value to a data column, making it invisible to unauthorized users.
What’s the difference between data masking and data tokenization?
Data masking best practices call for the substitution of real, sensitive data with fake, yet lifelike, data, in order to maintain its ability to carry out business processes. Masked data can’t be reverse-engineered. Data masking replaces personal or sensitive information with random values, without any way to reveal the original ones.
Data tokenization obfuscates sensitive data by replacing it with a meaningless token, for use in databases or internal systems. The tokenization of data process secures data at rest, and in motion. If somebody needs the real data value, the token can be “detokenized” back to its original state.
What’s the difference between data masking and synthetic data generation?
Data masking tools protect personal information by replacing real-life, sensitive data with jumbled, yet statistically equivalent, data. Although information that has undergone data masking can’t be reidentified, it remains functional for non-production environments. By using masked, instead of real-life, data, personal information is protected in the case of a breach.
Synthetic data generation does not obscure sensitive data. Instead, it builds artificial, yet lifelike, datasets, enabling development and testing teams to test new software quickly, while virtually eliminating any risks of non-compliance.
What’s the difference between data masking and data encryption?
Data encryption and data masking techniques are two separate approaches to data privacy management.
Data masking replaces real, sensitive data with fake, yet realistic, data. Although masked data can’t be reverse-engineered or identified, it’s still functional for software testing and data science.
Data encryption, which converts plaintext into incomprehensible ciphertext, employs a mathematical algorithm that acts as a cryptographic key. Those with access to the key can view the original, plaintext data.
Unlike masked data, encrypted data is vulnerable to data breaches via hacking or social engineering.