Data scientists are rare and expensive resources in all organizations. To maximize their skills and experience, they should focus on building data models, and running predictive analytics – and not on understanding the structure of the data, and where to find it. They need their data pipeline tools to be fully automated, without compromising their abilities to explore the data, and run experimental models.
Today, most self-service data preparation tools deliver partial automation for ad-hoc data preparation tasks. They can’t operationalize data, in order to achieve a fully automated data pipeline, whose flows can be built once, and reused afterwards.
Data flows are iterative, and normally built, tested, and packaged by data engineers.
Once the flows are packaged, they can be invoked and scheduled by data scientists. When activating a pre-built data preparation flow, the data scientist can decide which data to generate (e.g., where, when, and how it will be delivered).
An Enterprise Data Pipeline uses data versioning, so data scientists can also reproduce previous data sets, as well as access historical versions of the data.
It keeps your data lakes and data warehouses in sync with your data sources, based on pre-defined data sync rules. Data changes can be ingested into your data stores via the delivery method of your choice: Extract, Transform, Load (ETL), data streaming, Change Data Capture (CDC), or messaging.