Fake data + Masked real data = Best test data

Written by Gil Trotino | September 26, 2023

Discover why combining fake data with masked real data leads to the most effective data for testing enterprise software quickly, safely, and at scale.

What is fake data and when is it used?

Fake data is fabricated information that appears real, but isn’t. The technical term for fake data is synthetic data.

Synthetic data generation tools protect Personally Identifiable Information (PII), so data privacy is always assured. They’re extensively used for testing software and training Machine Learning (ML) models.

In the case of software testing, fake data is used when real data is:

Non-existent, in early stages of development or when testing new capabilities or innovations.
Inaccessible, for security reasons.
Insufficient, when high volumes of test data are needed for load and performance testing.
Non-compliant, when a dataset contains PII or other sensitive data.

This article explains why enterprises should embrace both fake and masked real data to address the greatest variety of software testing and ML model training use cases.

Get the IDC Report on Synthetic Data Generation.

Generating fake data via AI

Generative AI synthetic data models generate rich, compliant, and production-like test data using Artificial Intelligence (AI).

Pros
Generative AI imitates the same patterns and structures of real data, based on massive amounts of information. Once the model has been trained, generating synthetic data is easy – especially for regression and integration testing.
Cons
The AI model is trained on production data, which may not exist in large enough quantities. Defining the model may require data science skills and an in-depth understanding of the underlying source systems and data hierarchies. Building, training, and validating the model – and ensuring referential integrity of the generated data – are all parts of an iterative process requiring time, effort, and expertise.

Generating fake data via business rules

A rules engine generates synthetic test data based on pre-defined business rules for each data element.

Pros
The rules method is the only viable way to generate fake data when real data isn’t available or accessible. Particularly suited to testing new software functionality or performing negative testing, it provides a high level of control over the generated test data.
Cons
A rules-based approach to generating fake data requires a detailed understanding of the business logic of the underlying systems of record, their data structures, and data hierarchies. Adequate for simple business logic (following clear and well-defined rules), this technique is not appropriate for generating data for enterprise-grade systems that involve complex business logic across multiple systems and their data sources. It’s also labor-intensive and time-consuming because separate rules must be defined for every data element.

Masking real data

When real data IS accessible, the best way to turn it into compliant test data is to mask its PII and sensitive data elements.

Pros
Test data sourced from masked production data is the most realistic and valid by design. And, unlike the generative AI and rule-based methods described above, no understanding of the underlying systems is required.
Cons
The major drawbacks of data masking is that it’s complicated – and that the parameters of the production data limit the variation, diversity, and referential integrity of the test data.

Embracing fake and masked real data

As explained above, each synthetic data generation technique has its use, so seek a solution that supports them all.

When available, masked real data is always best – unless negative testing or new functionality testing is required.

Ultimately, the decision of whether or how to combine fake and masked real data relies on your testing needs, access to production data, available resources, and technical knowhow.

Learn more about K2view synthetic data generation tools.

View full post