Automated Data Preparation – 4 Issues and 4 Answers

Gil Trotino

Gil Trotino

Product Marketing Manager, K2View

Today’s “automated data preparation” tools aren’t really automated. This article discusses 4 key challenges on the road to automation, and 4 ways to overcome them.

Table of Contents


Real-Time Data Preparation is a Science
Semi-Automation is a Semi-Solution
Transitioning to Fully Automated Data Preparation Tools
4 Tips to Fully Automated Data Preparation
Built-In Automation With Every Data Preparation Hub

Real-Time Data Preparation is a Science

An automated data preparation process enables data engineering teams to deliver clean, fresh, analytics-ready data to data scientists, in less than a second.

The problem is that commercially available “automated data preparation” tools are, in fact, NOT fully automated. Built to give data scientists and business analysts the ability to generate insights independently, the current supply of semi-automated solutions suffer from a few significant flaws.

Self-service tools enable data scientists to model the data they receive for analytics, based on their unique requirements – but tend to ignore the first critical steps of actual data preparation and delivery, performed by data engineers. These limited standalone products address a relatively small part of the overall data preparation process, and often fail to meet critical market demands.

31a
A partly automated data preparation solution translates into tedious, uninspiring work for data scientists. 

Semi-Automation is a Semi-Solution

There are major 4 problems with the self-service, semi-automated approach:

  1. Today’s approaches are insufficient

    Current self-service tools focus on the data scientist, but not on the data engineer, who must collect, map, connect, transform, and anonymize the data – all extremely complicated, and time-consuming, tasks. Full automation drastically reduces the amount of effort and resources now required, saving both time and money.

  2. Semi-automation leaves workers semi-satisfied

    Standalone solutions leave a significant part of the data scientist’s work outside the scope of automation. According to the 2020 Developer Survey from Stack Overflow, more than 20% of data scientists are actively looking for a new job. A fully automated process significantly reduces the amount repetitive and uninspiring tasks.

  3. Teams lack insights regarding the flow itself

    Self-service flows aren’t monitored thoroughly enough to allow teams to learn how often they’re used, and how they can be optimized. Real-time reporting maximizes the potential of the data flows, and saves a lot of time and work. However, today’s self-service solutions can’t provide this function due to their lack of automation.

  4. Needless complexity leads to inefficiency

    A data preparation process that isn’t fully automated, creates bottlenecks, delays time-to-insight, leads to employee dissatisfaction, and adds to the organization’s operational costs. Instead of giving data engineering teams the freedom to build superior data preparation flows independently, it complicates the process and turns data preparation into a burden.

31b

A fully automated data preparation solution lets data scientists focus on generating business insights.

Transitioning to Fully Automated Data Preparation Tools

When built correctly, data preparation flows can save companies a lot of time, work, and money. In 2017, Gartner predicted that 40% of data science tasks would be automated by 2020, but it’s up to us to make sure that this doesn’t mean automating parts of the process alone.

Instead of offering semi-automated, half-working solutions addressing only the final step of selection of data for analysis, we should operationalize and productize the entire data preparation process – including the phases owned by the data engineering teams – and fully automate it. Full automated data preparation tools allow data teams to focus on generating business insights to improve business outcomes.

4 Tips to Fully Automated Data Preparation

  1. Treat data as a product

    A data-driven enterprise maximizes the value of its IT systems by treating data as a product differentiated by quality (e.g., completeness, availability, accessibility, and general fitness for use). It productizes data preparation flows in order to drive business outcomes automatically.

  2. Serve data goals

    Data preparation flows do not exist in a vacuum. They should be built with the needs of the data-consuming teams in mind. Companies should look for solutions based on automated data preparation flows to serve multiple data-based business goals within the organization.

  3. Consider every step

    Automated data preparation flows should be constructed by data engineers so that they cover the 7 steps of collection, discovery and classification, cleansing, structuring, enrichment and anonymization, validation, and delivery. After building the automated data preparation flow, the analytics team can invoke, and activate, the relevant pre-built flows already tested and approved by the data engineer.

  4. Include a time machine

    The automated data preparation process must also include a “time travel” option for data versioning. Giving teams access to historical versions of the data, simplifies the process and allows everything to move faster. The ability to revert to a previous “good” version is also a critical function when errors occur. Time travel saves the need to store multiple versions of the data in advance, which leads to unnecessary storage costs and might cause confusion vis a vis the correct version.

Built-In Automation With Every Data Preparation Hub

 

A data preparation hub focused on data engineers complements the self-service tools used by data scientists today. It delivers complete, clean, and connected data, enabling the standalone tools to select the best data for a particular workload more quickly and effectively than ever before.

Build and automate your data preparation process based on your business needs using K2View Data Preparation Hub.