Your Data Preparation Process Requires a Data Fabric, Not Standalone Tools

Gil Trotino

Gil Trotino

Product Marketing Manager, K2View

For Your Data Preparation Process, Go with Data Fabrics, Not Standalone Tools

Table of Contents


Data Preparation Process Flow
Data Preparation Tools and Trends
Data Fabrics Enter the Fray
The Entity-Based Data Fabric – a New Approach to the Data Preparation Process
Data Fabric: The Future of Data Preparation

According to Gartner, by 2024, augmented data preparation, data catalogs, data unification, data virtualization and data quality tools will converge into a consolidated data fabric used for the majority of new analytics/data science projects.

In determining your company’s data preparation process, data fabrics offer you more than today’s standalone tools, since the same clean and trusted data can also be delivered to your operational applications. It also operationalizes the process, by having data engineers build data prep flows, and then package them for reuse by other data preparation flows and other departments.

Screen Shot 2021-06-17 at 15.02.54

Data Preparation Process Flow

The data preparation process lifecycle consists of the following steps:

  • Data discovery

  • Collection

  • Cleansing

  • Transformation

  • Enrichment

  • Anonymization (masking)

Ideally…

  • The data preparation process is automatic, including discovery of all data from all sources (legacy and otherwise), followed by data collection and cleansing.

  • It is then transformed into current, usable formats, and can even be enriched with new data, based on additional sources and processing.

  • The information is made unidentifiable – to protect personal and/or sensitive data – before being delivered to a data lake or warehouse, or recycled for another iteration.

  • The data can be made available to data stores in any ingestion mode – batch mode, data streaming, Change Data Capture (CDC), or messaging – with all data iterations stored, and ready for reuse.

  • The data preparation process can be operationalized – from a self-service, one-off model, into a lifecycle that can be packaged for automation.

Before describing how the ideal can become a reality, let’s have a look at current data preparation tools and trends.

Data Preparation Tools and Trends

There are 3 categories of data preparations tools: standalone, those offered by analytics providers, and those offered by integration providers.

Screen Shot 2021-06-17 at 17.14.58Niche standalone tools are consolidating into Analytics and Integration tools.


Standalone tools are self-service products for data scientists and data analysts, however, as Gartner predicts, the trend is for data preparation tools to be consolidated into analytics and integration platforms. Case in point, Paxata was recently acquired by DataRobot.

Analytics tools answer the needs of the business, with a focus on building analytical models, dashboards, and reporting. While this is an important endeavor, data preparation is just one of many elements in an analytics/BI/ML package. Because data integration is not a core competency among these providers, data preparation remains a thin component and will most likely not receive the kind of attention and investment it deserves.

Integration tools are developed for data engineers, with the providers speaking the same language as their IT peers. For vendors like Informatica, K2View, Qlik, and Talend, data preparation is their bread and butter. The depth of knowledge inherent in the world of data integration in all its complexity – accessing sources, mapping, real-time updating, etc. – makes data preparation for data engineers a much richer offering.

Data Fabrics Enter the Fray

Enterprises seeking solutions for their data preparation process today, should also be thinking about tomorrow, and give priority to a broader approach. As discussed, if the standalone tools are disappearing, and the analytics tools are likely to fall short on features and updates, that leaves the integration tools to lead the way. For the most part, integrated data preparation tools are part of a consolidated data fabric. Data fabrics already have the capabilities to connect to all data sources, process and transform the data, and then make it available to consuming applications. And they can also be used to feed this trusted data into data lakes and warehouses.

DPH-Jul-05-2021-11-29-43-62-AMIn an entity-centric data fabric, digital entities ensure that data always flows smoothly between sources and targets.

The Entity-Based Data Fabric – a New Approach to the Data Preparation Process

An entity-based data fabric is ideally suited for data preparation. It lets you define an intermediary data schema aggregating all the attributes of a business entity (such as a customer, or an order) across all systems, and provides you with the tools to prepare and deliver the data as an integrated “digital entity”.

So, instead of moving data table by table, you prepare and pipeline the data via digital entities. In other words, you collect data from all sources into digital entities – cleansing, enriching, masking, and transforming it along the way.

Complete and quick
With entity-based data preparation tools, the prepared data is always complete. Moreover, transformations are performed at the entity level (rather than at the table level), reducing performance to a fraction of the time it takes to execute table-level transformations.

Connected and understood
The entity fabric consistently delivers data that is connected, and therefore easily accessible for querying in the data lake. And the digital entity schema is easily understood by business analysts, who have no knowledge of the underlying source systems and data structures.

Analytical and operational
The prepared data in the fabric can be delivered to support offline analytical workloads (in data lakes and warehouses) as well as operational workloads – such as a 360 customer view, operational intelligence, and GDPR and CCPA compliance – without affecting sources or targets in any way.

Flexible and insightful
The data fabric treats a data lake or warehouse as both a target and a source. For example, insights that are generated in a lake can be pipelined back to be captured in the source systems.

Additionally, an entity-based data fabric can:

  • Serve complete and connected data for both operational and analytical use cases.

  • Support the entire data preparation process, and transitioning it from self-service to full automation.

  • Operationalize the process, by having data engineers build data prep flows, and then package them for reuse by other data preparation flows and other departments.

  • Apply data governance to control access to data, and ensure compliance with data privacy regulations.

Data Fabric: The Future of Data Preparation

Today’s enterprises need future-proof data preparation tools, such as those found in data fabrics. An entity-based data fabric operationalizes data preparation by turning it into a high-speed service – with split-second, source-to-target, data delivery – and by having data engineers build and package data preparation flows for reuse.

Learn how to keep your data lakes and warehouses fresh, with clean and complete data.