Test data management (TDM) is the process of provisioning the needed data to fully test software apps while ensuring data privacy compliance.
The shift to agile development, enabled by CI/CD pipelines, is accelerating the pace of innovation with increasing efficiency. For testing teams, this shift means that test data creation and provisioning need to keep up with the faster pace.
A recent report published by analyst firm Gartner details the pros and cons for the various methods of generating and provisioning test data.
01
Solving the test data problem
According to SDLC Partners, QA professionals spend over 30% of their testing time dealing with defective test data. Most spend between 5 to 15 hours per week setting up or addressing issues having to do with test data. That equates to at least one day a week wasted on test data provisioning.
Shift-left testing works in tandem with agile software delivery in raising the standards of quality testing. With this method, testing begins at the earliest stages of the development process rather than saving it for later-stage application development. To implement this approach, precise, high-quality test data must be available at the time the developer or QA engineer needs it, and reflect a variety of functional, edge use cases that will occur with real application users.
With production data fragmented across multiple enterprise systems, ensuring complete and harmonized data for testing is a constant struggle. The need to mask data, as required by privacy regulations, and create synthetic data to augment the existing dataset, adds an additional layer of complexity.
For data engineering and testing teams, delivering high-quality test data environments, at a rapid pace, is critical. This paper reviews the challenges faced on the journey to DevOps test data provisioning, and the steps required to get you there.
02
DevOps and test data automation
Test data preparation has always been a challenge, especially with the advent of agile development, CI/CD, and ephemeral data environments enabling parallelization and related resource and cost savings.
And, with the growing popularity of data-as-a-service automation and the integration of multiple applications, provisioning valid test data has become even more complex.
For a test cycle to be effective, whether it's manual or automated, availability of test data that enables 100% test coverage is critical. Fresh, precise test data should be available for running functional and non-functional tests, whether executed manually or using a test automation tool. The test data should enable the entire scope of the software to be tested.
Test data automation is the process of automatically delivering test data to lower environments, as requested by software and quality engineering teams. It integrates test data into an organization's DevOps CI/CD pipelines, ensuring that the test data is complete, precise, and current.
Automated procedures that provision test data by connecting directly to the source systems might create an overload that impacts performance. To prevent this, instead of building one automation flow that extracts data from the source systems, and delivers it directly to the testing environment, the process should be done in 3 separate steps:
-
Connect to all your data sources to synchronize data extraction.
-
Integrate, mask, transform, subset, and generate your test datasets.
-
Provision the test data, from your test data management tools to your testing environments, on demand.
This division, with TDM playing a central role, ensures that the testing environments receive the test data they need, without bringing down the production systems.
When provisioning production-grade test data, the data should be requested from your TDM tools, and not the source system, to prevent a flood of requests that may result in system instability. The 3-step approach to test data automation is also more secure, because it minimizes direct access to production data.
Automation via a data product platform serves to:
- Reduce infrastructure costs, by decreasing storage, and sharing data products across domains.
- Eliminate release cycle friction, with the quick provisioning of high-quality test data.
03
Wanted: Fresh, precise, protected test data
Software delivery acceleration and quality improvements are enabled by a shift-left testing approach – where testing is performed in the early stages of software development. Testing as early as possible in the software delivery lifecycle enables testers to identify bugs and fix them more quickly and at lower cost. However, earlier testing requires earlier availability of quality test data.
Catalyzed by the rapid growth in applications and the need for much faster delivery, software development has shifted gears to agile development methodologies, releasing smaller software deliverables in quick sprints.
Fig. 1 Agile software delivery allows for continuous design,
development, testing, and deployment, in short sprints.
In support of delivering software in short, iterative sprints, DevOps test data management has emerged as the practice of provisioning precise test data on demand in support of smaller-scope deliveries.
Agile development and shift-left testing must be combined with a test data management strategy that includes data security integration as well as compliance with regulations like CPRA, GDPR, HIPAA, and PCI.
Otherwise – even with test automation tools – testing would be complex, error-prone, and may include unprotected sensitive data.
Data masking tools protect test data by replacing Personally Identifiable Information (PII) with scrambled, yet statistically similar, data. Masked test data can’t be identified or reverse-engineered, but remains functional for testing environments. The use of anonymized data, instead of original production data, safeguards sensitive information in the event of a mass data breach – shielding your company from financial, legal, and brand liability.
Synthetic data generation tools do the same for testing, but have no sensitive data to deal with. By creating a statistically equivalent synthetic dataset, testers can test new software quickly, and without security and non-compliance risk. A prime example of this is synthetic patient data, which allows for no re-identification of de-identified medical records.
In the test data masking vs synthetic test data showdown, DevOps and testing teams must decide which model, or combination of models, is most suitable for their particular needs.
"We’re on a journey to modernize our apps and to realize the benefits of embracing a DevOps methodology.
But you hit a roadblock if you don’t have realistic data to test against."
Ward Chewing, VP of Network Services & Shared Platform, AT&T
04
Top test data management challenges
Implementing an efficient and effective TDM process poses several key challenges for enterprises to overcome. These include:
-
Sourcing the test data
Enterprise data is typically siloed and dispersed across many different data sources – including legacy systems, such as mainframes and SAP, but also new systems, such as NoSQL and cloud. It’s also stored in various formats, making it hard for software teams to get the data they need, when they need it. QA and software engineers spend too much time waiting for test data. -
Subsetting the test data
Subsetting larger, diverse test datasets into smaller, more focused ones, enables QA teams to achieve full test coverage. Test data subsetting is particularly critical for recreating and fixing production issues. It also allows data teams to minimize the quantity of test data and its associated hardware and software costs. -
Protecting sensitive data
Data privacy regulations, such as CPRA, GDPR, and HIPAA, require that sensitive data and Personally Identifiable Information (PII) – such as names, Social Security Numbers, driver's licenses, and email addresses – be anonymized in the test environment. Discovering and de-identifying sensitive data and PII, while assuring referential integrity of the masked data, can be complex, time-consuming, and labor-intensive for data teams. -
Enforcing referential integrity
Referential integrity refers to data and schema consistency across databases and tables. Assuring the referential integrity of masked test data is critical to the validity of the data. -
Extending test coverage
Test coverage is a measure of how much of an application's code has been tested. Defining the necessary test cases is crucial, but making sure you have all the test data needed to fully operate the test cases is just as important. Low test coverage is directly related to high defect density. -
Reducing false positives and negatives
When test data is poorly designed, it often causes false positive errors, leading to valuable time and effort wasted in dealing with non-existent software bugs. When test data is insufficient, it leads to false negatives, which can affect the quality and reliability of the software. -
Reusing the test data
The ability to reuse test data is crucial when re-running test cases to verify software fixes. Versioning datasets for reuse enables teams to perform regression testing (re-testing to verify that the bugs discovered in previous tests have been resolved). -
Preventing data overrides
A common challenge for QA teams arises when testers accidentally override each other's test data, leading to corrupted test data and wasted time and effort. When this happens, the test data must be re-provisioned, and the tests re-run.
05
Proven test data management strategy
Adopting a proven test data management strategy enables enterprises to accelerate test data provisioning and increase the quality of software delivery. Here are the 7 components necessary for provisioning test data at enterprise scale and complexity:
-
Define the test data requirements
Start by having QA teams define the data subsets they need. This includes the environment that the test data should be provisioned from and delivered to, the criteria by which to subset the data, and whatever data transformations may be required (e.g., data formatting or aging) to safely move it from a higher environment to a lower one – from production to staging, for example.
-
Anonymize PII for compliance
Any TDM strategy is incomplete without ensuring adequate privacy measures via data anonymization, data tokenization, or any other data protection method. Centralizing multi-source test data into a compressed and compliant test data store, that’s readily accessible to developers and testers, is the foundation for provisioning trusted test data. -
Subset for maximum coverage
Having established which test data is needed, it’s time to extract it from the higher environment. When the required data is dispersed across many different systems, a TDM solution, capable of extracting multi-source data according to user-defined subset criteria, is critical.
-
Transform the data to make it compatible
Ensure that the data is accurately prepared for testing prior to the start of the testing cycle. There are common scenarios that require the data from a higher environment to be transformed before it can be used in a lower environment. For example, when the software being tested includes changes to the underlying data schema or when the data in a higher environment needs to be "aged" to meet the needs of the test case. -
Operate: Roll back, reserve, and refresh
Testing is an iterative process. When bugs are discovered and fixed, testing is repeated to validate resolution. Testers should be able to quickly roll back the test data that was previously used, without impacting the test data currently being used for other tests. They should also be able to reserve test data (to prevent testers from overriding each other’s test data) and to instantly refresh pristine test data subsets from any higher environment. Further, all test data operations should be performed by developers and testers via a self-service workbench, to facilitate on-demand test data provisioning.
-
Create synthetic data
Synthetic data creation is a critical component of test data management, enabling enterprises to generate fake data for software testing. It's required when real data is insufficient or inaccessible, or when new software functionality needs to be tested. Modern TDM solutions should include the means to easily generate fake data on demand.
-
Provision test data from/to any environment
Software teams must be able to move test data from any source to any target environment. Why? Because, today, a tester might build the “perfect test dataset” for sprint 11, only to have the environment erased at the end of the sprint. That’s a waste. If that perfect test dataset could be kept intact, and then reused in, say, sprint 12, that would improve both productivity and the employee experience.
06
Test data management solutions
Coupling agile software development with high-performance test data environments, saves enterprises millions of dollars (as explained in the next section). Test data automation increases test coverage, accelerates software delivery, reduces testing costs, and enhances the end-user experience.
The challenge is to find the most suitable TDM solution for your organization. A test data management approach based on business entities is ideal for overcoming enterprise complexities.
An entity-based approach ingests data from the source environments; masks, compresses, and versions it in flight; and then organizes it by business entities (e.g., customers, orders, loans, or products) into a test data store.
Fig 2. Test data management approach based on business entities
An entity-based TDM solution enables testing teams to instantly:
- Subset test data using business parameters.
- Generate synthetic test data.
- Transform the data so that it suits the needs of the software being tested.
- Reserve test data to segregate it between testers.
- Snapshot and roll back to reuse data across testing cycles.
Entity-based test data management greatly simplifies and streamlines the test data management process, while enabling complete control of the process. The success of this approach is evidenced in Gartner Peer Reviews.
Discover K2view Test Data Management Tools,
the #1 enterprise solution.
07
Benefits of test data management
There are several test data management benefits derived from implementing the right TDM solution:
- Agility and speed to market
Providing development and testing teams with the right data, at the right time, enhances agility and accelerates time to market for software applications. - Software quality
Test data management increases test coverage and shifts testing to the left, both of which improve the quality of delivered software by reducing defect density. - Cost efficiencies
When done well, TDM improves cost efficiencies by reducing hardware and software costs, accelerating test data provisioning, preventing data duplication, better balancing the use of resources, and providing self-service capabilities to improve productivity. See the next section for more details. - Compliance
Your TDM solution should provide both test data generation tools and data masking tools to ensure that only authorized personnel have access to sensitive data, enabling companies to comply with data protection regulations like CPRA, GDPR, HIPAA, and PCI. - Employee experience
For data engineers, copying production databases into staging environments, by manually scrubbing, masking, and formatting data, is a long, tedious, repetitive process. For development and QA teams, a lot of time and effort are wasted on waiting for the data, using the wrong data, dealing with problems related to the data (e.g., reporting false positives, lacking sufficient test coverage, or overriding each other’s test data). The right TDM tools improve job satisfaction for all test data stakeholders. - Tester/developer productivity
TDM empowers data teams to provision test data independently, without requiring data engineering or SQL expertise. So, instead of testers having to wait days and weeks for data teams to provision their test data, they’ll be able to access the test data subsets they need in minutes – with “do-it-yourself” test data provisioning software.
08
Test data management ROI
Return on investment (ROI) for a test data management tool can be quantified in 4 dimensions:
-
Reduction in test data provisioning costs, automating up to 70% of conventional manual tasks such as scripting, scrubbing and masking.
-
Improvement in test data delivery speeds, team productivity, and time to market, reducing application delivery cycle times by as much as 25% and test environment refresh times from 3 days to 3 minutes.
-
Savings derived from shifting testing to the left and expanding test coverage, enabling the detection and correction of errors earlier in the lifecycle.
-
Optimization of test data storage and database costs, by subsetting, generating synthetic data, and compressing test data storage, for more compact test datasets.
Fig 3. Test data management ROI:
Over time, the benefits greatly outweigh the costs.
The global enterprises that rely on test data management hail from a wide range of industries, including telecommunications and media, financial services, healthcare, retail, and more.
The proof is in the numbers. For example, check out the percentage improvements achieved at this TDM Fortune 500 bank.
For more details on calculating the ROI of test data management, read the test data management ROI report.
Test data management FAQs
What are the major challenges of test data management?
The major challenges of TDM testing include:
-
Managing enterprise complexity: Test data provisioning becomes be a highly complex process when multiple, heterogenous source systems, multiple test environments, and multiple software teams are involved.
-
Complying with data security and privacy regulations: Test data often contains personal or sensitive information which must be obscured by data masking tools or data tokenization tools.
-
Ensuring referential consistency: Test data subsets must maintain referential consistency, to ensure that they are complete, even after anonymization
-
Ensuring data relevance: Maintaining fresh, accurate data is a challenge, especially when large amounts of test data are required.
-
Managing huge amounts of data: The massive quantity of test data can be difficult and costly to organize and persist, and can impact performance, scalability, and total cost of ownership.
-
Anonymizing test data: Test data must be carefully masked to ensure that PII data is never compromised, eliminating risk of breaching data privacy regulations.
What is DevOps test data management?
DevOps test data management integrates testing into the DevOps pipeline, by automating the collection, delivery, and management of test data as part of the Continuous Integration / Continuous Delivery (CI/CD) process. DevOps is interested in speeding up the testing process, enhancing the cooperation between development and testing teams, and improving the overall application quality. When test data management is embedded in the DevOps pipeline, enterprises can better manage the large volumes of data generated during testing, and ensure that tests are executed with accurate, consistent, and relevant data.
What are the benefits of test data management?
Test data management benefits include:
-
Better test data coverage: By linking test data between test cases and requirements, test data management can deliver a 360 view of test data coverage and identify error patterns.
-
Reduced costs: From a test data warehouse, the appropriate data can be provisioned for different testing types (e.g., functional, integration, performance, etc.), reducing the need for redundant data copies and extra storage.
-
Greater compliance and security: To ensure compliance with data privacy regulations, data masking and synthetic data generation have become intrinsic to test data management.
-
Test data reusability: Reusable test data is categorized and archived in the test data warehouse for future use by testers, reducing costs even further.
-
No data copies needed: When a test data warehouse is used by all teams, relational data integrity, and optimized storage, are maintained.
-
Better applications, with fewer defects: Shift-left testing identifies and deals with problems earlier on, in the testing phase. The result is customer trust.
What are the key components of a test data management strategy?
The key components of a test data management strategy are:
-
Provisioning data by business entity: A solid test data management strategy lets testers provision lifelike, trusted data systematically, for any test case, on demand.
-
Extracting data inflight: The TDM testing team, or test data automation system, should be able to request the necessary test data, on the fly, with no preparation.
-
Refreshing/syncing data continuously: Testers need a test data management system capable of refreshing data (granularly, for each component) and is simple to sync – for rollbacks, should the need arise.
-
Anonymizing sensitive data: The ability to unify test data from multiple sources, perform data anonymization or de-anonymization, as needed, while constantly protecting it, is critical to compliance with privacy laws.
-
Synthesizing data when required: When there’s not enough test data available from production sources, synthetic-data generation tools allow testing teams to fill any gaps with artificial, yet very lifelike, data.
How does test data management fit into the software development life cycle?
Test data management is an important part of the software development life cycle. It typically begins with the creation of test data, and then continues through the execution of tests, with data being refreshed and synced as needed. The test data is used to improve application quality, and can be reused for future efforts.
Automating TDM testing accelerates agile software delivery and expands regression testing coverage. Enterprises are adopting a 3-step process to do this:
-
Extract: Connect to all data sources to synchronize data extraction.
-
Manage: Integrate, mask, transform, subset, and generate test datasets in the test data management platform.
-
Provision: Provision the test data from the test data management system to the testing environments, on-demand.
How do test data management tools connect with source systems?
TDM tools provision data from and to legacy and modern systems via native database connectors and APIs. These include:
-
SQL databases, such Oracle, MS SQL, PostgreSQL,
-
NoSQL databases, such as Azure Cosmos, Couchbase, MongoDB, and Redis
-
Legacy MF databases, such as IBM DB/2, IMS, and DB/400
-
Packaged applications, such as SAP (see SAP test data management), Salesforce, and Loan IQ
What are the common types of test data?
There are 4 types of test data used for testing software systems:
-
Production database copies, which is costly in terms of database hardware and software licenses, time-consuming to run (and therefore performed once every several months), and pose significant data privacy risks.
-
Data subsets, which are the minimal datasets needed to cover the entire scope of a given set of test cases. Data subsets alleviate the time and costs of provisioning full production copies, but may still contain sensitive data.
-
Masked data subsets, represent the ideal form of test data, because they contain the minimal dataset required for 100% test coverage, while adhering to data privacy regulations.
-
Synthetic data is computer-generated data, and serves as the most appropriate form of test data for testing new software functionality (for which test data doesn't exist), and for negative testing (to make the software fail intentionally).
Organizations should look for a test data management solution that supports the combination of data subsetting, data masking, and synthetic data generation.