With the proliferation of personal data – collected by enterprises across all industries – the need for protecting individual privacy is paramount. One way to protect privacy is to mask personally identifiable information (PII) data, by consistently changing names, or including only the last 4 digits in a credit card or social security number.
This whitepaper explores today’s data masking techniques, the challenges they pose for enterprises, and a novel approach, based on data products, that addresses these challenges in the most comprehensive manner.
Data masking protects sensitive data by creating a version of the data that can’t be identified or reverse-engineered. It should assure data consistency, and usability ,across multiple databases.
Data masking substitutes real information with random characters.
The most common types of data requiring data masking include:
IP: Intellectual Property
PCI-DSS: Payment Card Industry Data Security Standard (payment card information)
PII: Personally Identifiable Information
PHI: Protected Health Information
As a rule, data masking is used in non-production environments – such as software development, data science, and testing – that don’t require the original production data.
Simply defined, data masking combines the processes and tools for making sensitive data unrecognizable, but usable, by software or authorized personnel.
Data obfuscation refers to a variety of processes that transform data into another form, in order to secure and protect it. The 3 most common data obfuscation methods are data masking, data encryption, and data tokenization. While data masking is irreversible, encryption and tokenization are both reversible in the sense that the the original values can be derived from the obscured data. Here’s a brief explanation of the 3 methods:
Data masking substitutes realistic, but fake, data for the original values to ensure data privacy. Development, support, data science, business intelligence, testing, and training teams use masked data in order to make use of a dataset without exposing real data to any risk.
Data masking is implemented via many techniques, such as data scrambling, data blinding, or data shuffling, which will be explained in greater detail later on. The process of permanently removing all Personally Identifiable Information (PII) from sensitive data is also known as data anonymization, or data sanitization. There is no algorithm to recover the original values of masked data.
While data encryption is very secure, data teams can’t analyze or work with encrypted data. The more complex the encryption algorithm, the safer the data will be from unauthorized access. Encryption is ideal for storing or transfering sensitive data securely.
Data tokenization, which substitutes sensitive data with random (meaningless) data, can’t be reversed. However, the token can be mapped back to the original data, which is stored in a secure “data vault”.
Tokenized data supports operations like processing a credit card payment without revealing the credit card number. The real data never leaves the organization, and can’t be seen or decrypted by a third-party processor.
Data tokenization supports the Payment Card Industry Data Security.
Data masking is the most common data obfuscation method. The fact that data masking is not reversible makes this type of data obfuscation more secure and less costly than encryption.
Another big plus is that data masking maintains data integrity across systems and data bases, which is critical in software testing and data analysis. Minimizing the use of actual data protects an enterprise from unnecessary risk.
In the case of obfuscated data, integrity means that the dataset maintains its validity and consistency, despite undergoing data anonymization. For example, a real credit card number can be replaced by any 16-digit value that is validated by the “CheckSum” function. Once anonymized by a new value, the same (new) value must be used consistently across all systems.
In short, there are 2 major differences between data masking and other data obfuscation methods like encryption or tokenization:
Data masking is important to enterprises because it enables them to:
While data masking has been around for decades, it is now needed more than ever to effectively protect sensitive data, and to address the following challenges:
Highly regulated industries, like financial services and healthcare, already operate under strict privacy regulations, including the Payment Card Industry Data Security Standard (PCI DSS), and the Health Insurance Portability and Accountability Act (HIPAA). Since the introduction of Europe’s GDPR in 2018, there has been a proliferation of privacy laws across the globe including CCPA and CCPR in California, LGPD in Brazil, and PDPA in the Philippines and Singapore. Such privacy laws seek to protect Personally Identifiable Information (PII) by, and restrict access to it whenever possible.
Many employees and third-party contractors access enterprise systems on a regular basis. Production systems are particularly vulnerable, because sensitive information is often used in development, testing, and other pre-production environments. With insider threats rising 47% since 2018, according to the Ponemon Institute report, containing sensitive data costs companies an average of more than $200,000 per year.
In 2020, personal data was compromised in 58% of the data breaches, states a Verizon report. The study further indicates that in 72% of the cases, the victims were large enterprises. With the vast volume, variety and velocity of enterprise data, it is no wonder that breaches proliferate. Taking measures to protect sensitive data in non-production environments will significantly reduce the risk.
Data masking is commonly used to control data access. While static data masking obscures a single dataset, dynamic data masking provides more granular controls. With dynamic data masking, permissions can be granted or denied at many different levels. Only those with the appropriate access rights can access the real data. Others will see only the parts that they have to see.
Data masking is highly customizable. Data teams can choose which data fields get masked, and how to select and format each substitute value. For example, every Social Security Number (SSN) has the format xxx-xx-xxxx, where “x” is a number from 0 to 9. They can substitute the first five digits with the letter x, or all 9 numbers with other random numbers, according to their needs.
Over time, a variety of data masking techniques have been devised. Selecting the right approach is dependent on the intended data use. The goal is to maximize data protection, while minimizing data exposure.
Static data masking
Non-production environments, such as those used for analytics, testing, training, and development purposes, often source data from production systems. In such cases, private data is protected with static data masking, a one-way transformation ensuring that the masking process cannot be undone. When it comes to testing and analytics, repeatability is a key concept because using the same input data delivers the same results. This requires the masked data values to persist, over time, and through multiple extractions.
Static data masking is usually employed on a copy of a production database. It makes data look real enough to permit accurate development, testing, and training, without exposing the original data.
Dynamic data masking
Dynamic data masking is used to protect, obscure, or block access to, sensitive data. While prevalent in production systems, it is also used when testers or data scientists require real data. Dynamic data masking is performed in real time, in response to a data request. When the data is located in multiple source systems, masking consistency is difficult, especially when dealing with disparate environments, and a wide variety of technologies. Dynamic data masking protects sensitive data on demand.
Dynamic data masking automatically streams data from a production environment, to avoid storing the masked data in a separate database. As a rule, it’s used for role-based security for applications – such as handling customer queries, or processing sensitive data, like health records – and in read-only scenarios, so that the masked data doesn’t get written back to the production system.
On-the-fly data masking
When analytics or test data is extracted from production systems, staging sites are often used to integrate, cleanse, and transform the data, before masking it. The masked data is then delivered to the analytics or testing environment. This multi-stage process is slow, cumbersome, and risky due to the possible exposure of private data.
On-the-fly data masking is performed on data as it moves from one environment to another, such as from production, to development or test. It’s ideal for enterprises engaging in continuous software development and large-scale data integrations. A subset of the masked data is generally delivered to authorized users upon request, because keeping a backup of all the masked data is inefficient and impractical.
Statistical data masking
Production data can hold different statistical information, which statistical data obscuration techniques can masquerade. Differential privacy is one technique where you can share information about patterns in a data set without revealing information about the actual individuals in the data set.
Test data masking
Applications, of any kind, require extensive testing before they can be released into production. Any production data provisioned for testing must be masked to protect sensitive information. For example, in a legacy modernization program, the modernized software components must be tested continuously, making test data masking a key component in the testing process. Masking data with referential integrity – from production systems, to the test environments – is critical.
Unstructured data masking
Scanned documents and image files, such as insurance claims, bank checks, and medical records, contain sensitive data stored as images. Many different formats (e.g., pdf, png, csv, email, and Office docs) are used daily by enterprises in their regular interactions with individuals. With the potential for so much sensitive data to be exposed in unstructured files, the need for unstructured data masking is obvious.
There are several techniques associated with data masking, including:
Scrambling randomly orders characters and/or numbers to obscure the original content. For example, when a shipment with tracking number 572918 in a production environment undergoes character scrambling, it might read 125879 in a different environment. Although easy to implement, scrambling can only be used on certain data types, and is not as secure as other techniques.
Data scrambling assures that the data can’t be easily traced back to its source.
Nullifying applies a null value to a data column so that unauthorized users won’t be able to see the actual data in it. Despite its ease of implementation, nullifying results in data with less integrity, which is often problematic in development and testing environments.
Substitution, which replaces the original data with another value, is one of the most effective data masking techniques because it preserves the original nature of the data. Although difficult to execute, substitution can be applied to several types of data, and is excellent protection against data breaches.
Like substitution, shuffling uses the same individual masking data column for randomly ordering characters or numbers. For example, when patient name columns are shuffled across multiple patient records, the results look accurate but don’t reveal any personal medical information. However, anyone with access to the shuffling algorithm can reverse-engineer the process.
Data/number variance is used for masking important financial and transaction date information. For example, masking the employee salaries column with the employee salary variance, displays the salaries between the highest- and lowest-paid employees. Data integrity can be assured by applying a variance of, say, +/- 5% to all salaries in the dataset.
Date aging increases or decreases a date field based on a pre-defined data masking policy, within a specific date range. For example, decreasing the date of birth field by 1,000 days would change the date 1-January-2023 to 7-April-2020.
Data masking is challenging because not only must the altered data retain the basic characteristics of the original data, it must also be transformed enough to eliminate the risk of exposure, while retaining data integrity
Enterprise IT landscapes typically have many production systems, that are deployed on premises and in the cloud, across a wide variety of technologies. To mask data effectively, an organization needs to:
K2View Data Masking is powered by the K2View Data Product Platform, which organizes fragmented data from disparate systems according to business entities – customer, order, device, or anything else that’s important to the business. It then processes the data, applying data transformations, enrichments, and masking, and makes it instantly accesssbile to authorized data consumers.
A data product unifies everything a company knows about a specific business entity – including all interactions, transactions, and master data. The data for each specific business entity is managed in its own Micro-database, which is encrypted by a 256-bit encryption key. And the PII data in the Micro-Database is masked in-flight, according to predefined business rules.
K2View Data Masking supports dynamic data masking for operational use cases, like Customer 360, and static data masking for test data management and data pipelining into data lakes and data warehouses.
No staging needed
K2View in-flight data masking eliminates the need for slow, cumbersome, and risk-prone staging areas, where unmasked data is exposed to potential breaches. Using a graphical data orchestration tool, data from multiple production systems is integrated, cleansed, and masked on the fly.
A data product approach to data masking simplifies complexity, ensuring that individual customer data from different sources is:
Consistent, across multiple sources
Persistent, over time and multiple extractions
Preserved, with referential integrity and formatting
Dynamic data masking
K2View dynamic data masking transforms, obscures, or blocks access to sensitive information fields based on user roles and testing environment privileges.
And with data orchestration, a wide variety of in-line masking functions can be invoked to protect the data.
Unstructured data masking
Protect unstructured data including images, PDFs, XML, CSV, text-based files, and more, with static and dynamic masking capabilities. Replace sensitive photos with fake alternative ; use OCR to detect content and enable intelligent masking, synthetically generate digital versions of receipts, checks, contracts and other items for testing purposes.
By managing unstructured data within a data product schema,
referential integrity and consistency are ensured.
Extensive and extendible masking functions
K2View Data Masking comes with a comprehensive library of prebuilt masking functions, designed to provide realistic, but fake, data.
The table below highlights a few examples, including masking that creates a valid social security number (SSN), selecting (masked) names from name directories, as well as generating random numbers, and address-based zip codes. The library can be easily extended by custom Java functions that implement additional masking functions.
Generate valid SSN
Generate valid number based on card type
First name/Last name/Zip code
Select from collection
Shuffle (preserve statistical diversity)
Concatenation based on new first and last name
Static masking based on a pre provided value
Based on the provided Zip
Data masking has become a pillar technology that large organizations use to comply with privacy protection regulations.
Although the practice of masking data has been around for years, the sheer volume of data – structured and unstructured – and the ever-changing regulatory environment, have increased the complexity of data masking at enterprise scale.
Current data masking techniques are proving to be insufficient. However, a new data masking approach, based on data products, has proved its effectiveness in many of the world’s largest enterprises.