Test Data Management (TDM) is the process of efficiently provisioning compliant data to test an application throughout its software development lifecycle.
Solving the Test Data Problem
The shift to agile development, with Continuous Integration / Continuous Development (CI/CD) pipelines, is accelerating the pace of innovation with increasing efficiency. For testing teams, this shift means that test data design and provisioning need to keep up with the faster pace.
According to SDLC Partners, QA professionals spend over 30% of their testing time dealing with defective test data. Most spend between 5 to 15 hours per week setting up or addressing issues having to do with test data. That equates to at least one day a week wasted on test data provisioning.
Shift-left testing works in tandem with agile software delivery in raising the standards of quality testing. With this method, testing begins at the earliest stages of the development process rather than saving it for later-stage application development. To implement this approach, precise, high-quality test data must be available at the time the developer or QA engineer needs it, and reflect a variety of functional, edge use cases that will occur with real application users.
With production data fragmented across multiple enterprise systems, ensuring complete and harmonized data for testing is a constant struggle. The need to mask data, as required by privacy regulations, and create synthetic data to augment the existing dataset, adds an additional layer of complexity.
For data engineering and testing teams, delivering high-quality test data environments, at a rapid pace, is critical. This paper reviews the challenges faced on the journey to DevOps test data management, and the steps required to get there.
Table of Contents
DevOps Test Data Management
Software delivery acceleration and quality improvements are enabled by a shift-left testing approach – where testing is performed in the early stages of software development. Testing as early as possible in the software delivery lifecycle enables testers to identify bugs and fix them more quickly and at lower cost. However, earlier testing requires earlier availability of quality test data.
Catalyzed by the rapid growth in applications and the need for much faster delivery, software development has shifted gears to agile development methodologies, releasing smaller software deliverables in quick sprints.
Agile software delivery allows for continuous design, development, testing, and deployment, in short sprints.
In support of delivering software in short, iterative sprints, DevOps test data management has emerged as the practice of provisioning precise test data on demand in support of smaller-scope deliveries.
Agile development and shift-left testing must be combined with a test data management strategy. Otherwise – even with test management and test automation tools – executing software testing is slow, complex, and error-prone.
Wanted: Fresh, Precise Test Data
Test data preparation has always been a challenge, especially with agile development and CI/CD. And with the growing popularity of data-as-a-service automation, and the integration of multiple applications, provisioning valid test data has become even more complex.
For a test cycle to be effective, whether it's manual or automated, availability of test data that enables 100% test coverage is critical.
Fresh, precise test data should be available for running functional and non-functional tests, whether executed manually or using a test automation tool. The test data should enable the entire scope of the software to be tested.
"We’re on a journey to modernize our apps and to realize the benefits of embracing a DevOps methodology. But you hit a roadblock if you don’t have realistic data to test against."
Ward Chewing, VP of Network Services & Shared Platform, AT&T
Top Test Data Management Challenges
Implementing an efficient and effective test data management process poses several key challenges for enterprises to overcome. These include:
Sourcing the test data
Enterprise data is typically siloed and dispersed across many different data sources – including legacy systems, such as mainframes and SAP, but also new systems, such as NoSQL and cloud. It’s also stored in various formats, making it hard for software teams to get the data they need, when they need it. QA and software engineers spend too much time waiting for test data.
Subsetting the test data
Subsetting larger, diverse test datasets into smaller, more focused ones, enables QA teams to achieve full test coverage. Test data subsetting is particularly critical for recreating and fixing production issues. It also allows data teams to minimize the quantity of test data and its associated hardware and software costs.
Protecting sensitive data
Data privacy regulations, such as CPRA, GDPR, and HIPAA, require that sensitive data and Personally Identifiable Information (PII) – such as names, Social Security Numbers, driver's licenses, and email addresses – be anonymized in the test environment. Discovering and de-identifying sensitive data and PII, while assuring referential integrity of the masked data, can be complex, time-consuming, and labor-intensive for data teams.
Enforcing referential integrity
Referential integrity refers to data and schema consistency across databases and tables. Assuring the referential integrity of masked test data is critical to the validity of the data.
Extending test coverage
Test coverage is a measure of how much of an application's code has been tested. Defining the necessary test cases is crucial, but making sure you have all the test data needed to fully operate the test cases is just as important. Low test coverage is directly related to high defect density.
Reducing false positives and negatives
When test data is poorly designed, it often causes false positive errors, leading to valuable time and effort wasted in dealing with non-existent software bugs. When test data is insufficient, it leads to false negatives, which can affect the quality and reliability of the software.
Reusing the test data
The ability to reuse test data is crucial when re-running test cases to verify software fixes. Versioning datasets for reuse enables teams to perform regression testing (re-testing to verify that the bugs discovered in previous tests have been resolved).
Preventing data overrides
A common challenge for QA teams arises when testers accidentally override each other's test data, leading to corrupted test data and wasted time and effort. When this happens, the test data must be re-provisioned, and the tests re-run.
Securing Test Data
Test data management solutions must protect sensitive data either by masking it, or by using synthetic data instead of real data.
Data masking protects sensitive data and PII by creating a version of the data that can’t be identified or reverse-engineered, while also retaining referential integrity. Sensitive data that’s typically masked includes:
PII: Personally Identifiable Information, to comply with privacy regulations, such as GDPR, CPRA, and HIPAA
PCI-DSS: Payment Card Industry Data Security Standard (payment card information)
PHI: Protected Health Information
IP: Intellectual Property
Data masking best practices call for its use in non-production environments – such as data science, analytics, software development, and testing – that don’t require original production data.
Simply defined, data masking combines the processes and tools for making sensitive data unrecognizable, yet functional, to certain users.
Synthetic data generation
Synthetic data generation is the process of creating artificial data that mimics the statistical patterns and properties of real-life data, or that best exemplifies how new features will function.
Synthetic data is generated using a variety of software algorithms. Even though it’s usually generated based on real data, synthetic data often contains no actual data from the original dataset.
Synthetic data contains no PII, making it ideal for compliance with data privacy laws when testing software under development and training ML models.
Two of the most prevalent synthetic data generation techniques are:
Generative AI, that uses Machine Learning (ML) to generate the data required to maximize test coverage.
A rules engine, that creates synthetic test data via user-defined business rules and parameters for the data.
Proven Test Data Management Strategy
Adopting a proven test data management strategy enables enterprises to accelerate test data provisioning and increase the quality of software delivery. Here are the 7 components necessary for provisioning test data at enterprise scale and complexity:
Define the test data requirements
Start by having QA teams define the data subsets they need. This includes the environment that the test data should be provisioned from and delivered to, the criteria by which to subset the data, and whatever data transformations may be required (e.g., data formatting or aging) to safely move it from a higher environment to a lower one – from production to staging, for example.
Anonymize PII for complianceAny test data management strategy is incomplete without ensuring adequate privacy measures via data anonymization, data tokenization, or any other data protection method. Centralizing multi-source test data into a compressed and compliant test data store, that’s readily accessible to developers and testers, is the foundation for provisioning trusted test data.
Subset for maximum coverage
Having established which test data is needed, it’s time to extract it from the higher environment. When the required data is dispersed across many different systems, a test data management solution, capable of extracting multi-source data according to user-defined subset criteria, is critical.
Transform the data to make it compatibleEnsure that the data is accurately prepared for testing prior to the start of the testing cycle. There are common scenarios that require the data from a higher environment to be transformed before it can be used in a lower environment. For example, when the software being tested includes changes to the underlying data schema or when the data in a higher environment needs to be "aged" to meet the needs of the test case.
Operate: Roll back, reserve, and refresh
Testing is an iterative process. When bugs are discovered and fixed, testing is repeated to validate resolution. Testers should be able to quickly roll back the test data that was previously used, without impacting the test data currently being used for other tests. They should also be able to reserve test data (to prevent testers from overriding each other’s test data) and to instantly refresh pristine test data subsets from any higher environment. Further, all test data operations should be performed by developers and testers via a self-service workbench, to facilitate on-demand test data provisioning.
Create synthetic data
Synthetic data creation is a critical component of test data management, enabling enterprises to generate fake data for software testing. It's required when real data is insufficient or inaccessible, or when new software functionality needs to be tested. Modern test data management solutions should include the means to easily generate fake data on demand.
Provision test data from/to any environmentSoftware teams must be able to move test data from any source to any target environment. Why? Because, today, a tester might build the “perfect test dataset” for sprint 11, only to have the environment erased at the end of the sprint. That’s a waste. If that perfect test dataset could be kept intact, and then reused in, say, sprint 12, that would improve both productivity and the employee experience.
Test Data Management Solutions
Coupling agile software development with high-performance test data environments, saves enterprises millions of dollars (as explained in the next section). Test data automation increases test coverage, accelerates software delivery, reduces testing costs, and enhances the end-user experience.
The challenge is to find the most suitable test data management solution for your organization. A test data management approach based on business entities is ideal for overcoming enterprise complexities.
An entity-based approach ingests data from the source environments; masks, compresses, and versions it in flight; and then organizes it by business entities (e.g., customers, orders, loans, or products) into a test data store.
An entity-based test data management solution enables testing teams to instantly:
- Subset test data using business parameters.
- Generate synthetic test data.
- Transform the data so that it suits the needs of the software being tested.
- Reserve test data to segregate it between testers.
- Snapshot and roll back to reuse data across testing cycles.
Entity-based test data management greatly simplifies and streamlines the test data management process, while enabling complete control of the process. The success of this approach is evidenced in Gartner Peer Reviews.
Discover K2view Test Data Management Tools, the #1 enterprise solution.
Benefits of Test Data Management
There are several test data management benefits derived from implementing the right TDM solution:
- Agility and speed to market
Providing development and testing teams with the right data, at the right time, enhances agility and accelerates time to market for software applications.
- Software quality
Test data management increases test coverage and shifts testing to the left, both of which improve the quality of delivered software by reducing defect density.
- Cost efficiencies
When done well, test data management improves cost efficiencies by reducing hardware and software costs, accelerating test data provisioning, preventing data duplication, better balancing the use of resources, and providing self-service capabilities to improve productivity. See the next section for more details.
The test data management solution should provide for both test data generation tools and data masking tools to ensure that only authorized personnel have access to sensitive data, enabling companies to comply with data protection regulations like CPRA, GDPR, and HIPAA.
- Employee experience
For data engineers, copying production databases into staging environments, by manually scrubbing, masking, and formatting data, is a long, tedious, repetitive process. For development and QA teams, a lot of time and effort are wasted on waiting for the data, using the wrong data, dealing with problems related to the data (e.g., reporting false positives, lacking sufficient test coverage, or overriding each other’s test data). The right test data management solution improves job satisfaction for all test data stakeholders.
- Tester/developer productivity
Test data management empowers data teams to provision test data independently, without requiring data engineering or SQL expertise. So, instead of testers having to wait days and weeks for data teams to provision their test data, they’ll be able to access the test data subsets they need in minutes – with “do-it-yourself” test data provisioning software.
Test Data Management ROI
Return on investment (ROI) for a test data management tool can be quantified in 4 dimensions:
Reduction in test data provisioning costs, automating up to 70% of conventional manual tasks such as scripting, scrubbing and masking.
Improvement in team productivity and time to market, reducing application delivery cycle times by as much as 25%.
Savings derived from shifting testing to the left and expanding test coverage, enabling the detection and correction of errors earlier in the lifecycle.
Optimization of test data storage and database costs, by subsetting, generating synthetic data, and compressing test data storage, for more compact test datasets.
Test data management ROI: Over time, the benefits greatly outweigh the costs.
For more details on calculating the ROI of test data management, read the test data management ROI report.
What are the major challenges of test data management?
The major challenges of TDM testing include:
Managing enterprise complexity: Test data provisioning becomes be a highly complex process when multiple, heterogenous source systems, multiple test environments, and multiple software teams are involved.
Ensuring referential consistency: Test data subsets must maintain referential consistency, to ensure that they are complete, even after anonymization
Ensuring data relevance: Maintaining fresh, accurate data is a challenge, especially when large amounts of test data are required.
Managing huge amounts of data: The massive quantity of test data can be difficult and costly to organize and persist, and can impact performance, scalability, and total cost of ownership.
Anonymizing test data: Test data must be carefully masked to ensure that PII data is never compromised, eliminating risk of breaching data privacy regulations.
What is DevOps test data management?
DevOps test data management integrates testing into the DevOps pipeline, by automating the collection, delivery, and management of test data as part of the Continuous Integration / Continuous Delivery (CI/CD) process. DevOps is interested in speeding up the testing process, enhancing the cooperation between development and testing teams, and improving the overall application quality. When test data management is embedded in the DevOps pipeline, enterprises can better manage the large volumes of data generated during testing, and ensure that tests are executed with accurate, consistent, and relevant data.
What are the benefits of test data management?
Test data management benefits include:
Better test data coverage: By linking test data between test cases and requirements, test data management can deliver a 360 view of test data coverage and identify error patterns.
Reduced costs: From a test data warehouse, the appropriate data can be provisioned for different testing types (e.g., functional, integration, performance, etc.), reducing the need for redundant data copies and extra storage.
Greater compliance and security: To ensure compliance with data privacy regulations, data masking and synthetic data generation have become intrinsic to test data management.
Test data reusability: Reusable test data is categorized and archived in the test data warehouse for future use by testers, reducing costs even further.
No data copies needed: When a test data warehouse is used by all teams, relational data integrity, and optimized storage, are maintained.
Better applications, with fewer defects: Shift-left testing identifies and deals with problems earlier on, in the testing phase. The result is customer trust.
What are the key components of a test data management strategy?
The key components of a test data management strategy are:
Provisioning data by business entity: A solid test data management strategy lets testers provision lifelike, trusted data systematically, for any test case, on demand.
Extracting data inflight: The TDM testing team, or test data automation system, should be able to request the necessary test data, on the fly, with no preparation.
Refreshing/syncing data continuously: Testers need a test data management system capable of refreshing data (granularly, for each component) and is simple to sync – for rollbacks, should the need arise.
Anonymizing sensitive data: The ability to unify test data from multiple sources, perform data anonymization or de-anonymization, as needed, while constantly protecting it, is critical to compliance with privacy laws.
Synthesizing data when required: When there’s not enough test data available from production sources, synthetic-data generation tools allow testing teams to fill any gaps with artificial, yet very lifelike, data.
How does test data management fit into the software development life cycle?
Test data management is an important part of the software development life cycle. It typically begins with the creation of test data, and then continues through the execution of tests, with data being refreshed and synced as needed. The test data is used to improve application quality, and can be reused for future efforts.
Automating TDM testing accelerates agile software delivery and expands regression testing coverage. Enterprises are adopting a 3-step process to do this:
Extract: Connect to all data sources to synchronize data extraction.
Manage: Integrate, mask, transform, subset, and generate test datasets in the test data management platform.
Provision: Provision the test data from the test data management system to the testing environments, on-demand.
How do test data management tools connect with source systems?
TDM tools provision data from and to legacy and modern systems via native database connectors and APIs. These include:
SQL databases, such Oracle, MS SQL, PostgreSQL,
NoSQL databases, such as Azure Cosmos, Couchbase, MongoDB, and Redis
Legacy MF databases, such as IBM DB/2, IMS, and DB/400
Packaged applications, such as SAP (see SAP test data management), Salesforce, and Loan IQ
What are the common types of test data?
There are 4 types of test data used for testing software systems:
Production database copies, which is costly in terms of database hardware and software licenses, time-consuming to run (and therefore performed once every several months), and pose significant data privacy risks.
Data subsets, which are the minimal datasets needed to cover the entire scope of a given set of test cases. Data subsets alleviate the time and costs of provisioning full production copies, but may still contain sensitive data.
Masked data subsets, represent the ideal form of test data, because they contain the minimal dataset required for 100% test coverage, while adhering to data privacy regulations.
Synthetic data is computer-generated data, and serves as the most appropriate form of test data for testing new software functionality (for which test data doesn't exist), and for negative testing (to make the software fail intentionally).
Organizations should look for a test data management solution that supports the combination of data subsetting, data masking, and synthetic data generation.