Data preparation is the process of collecting, joining, culling, cleansing, and otherwise transforming big data into a form that applications and users can trust and readily ingest for analytical and operational use cases.
These use cases are constantly growing across the enterprise and include offline big data analysis (by data analysts and data scientists), as well as online operational use cases, in support of real-time intelligence. The need for trusted, high-quality, complete, and current data—to drive these use cases—is paramount. As a result, slow, tedious, error-prone, and manual processes for data preparation are simply no longer acceptable.
Why is proper data preparation so necessary? Simply put, it’s to ensure confidence. Confidence in the accuracy of the data, the way it is prepared, and of course, the results and insights we get from it. Consistent data preparation ensures that insights derived from the data can be trusted every time. Poorly prepared data can lead to erroneous conclusions—as they say, “garbage in, garbage out.”
Today’s organizations seek a single platform for self-service, automated data preparation, for all enterprise use cases—both online and offline.
Data preparation is critical to the success of all types of data processing, whether offline or real-time, and whether they are analytical or operational in nature. Here are just a few enterprise use cases where consistent, accurate data preparation is critical.
Offline use cases – These typically rely on well-prepared data that is stored in data warehouses and data lakes. As they analyze and operate on historical data, they aren’t concerned with up-to-the-moment data. However, the massive size of these data stores means data preparation can take a long time.
Some common examples that require high-quality offline data include:
• Ad-hoc business intelligence
• Predictive analytics
• Machine learning
• Test data provisioning
• Customer Data Platform for marketing use cases
Operational intelligence – Unlike offline analytics, operational use cases need up-to-date data to make rapid analyses, provide actionable in-the-moment insights, provide better customer experiences, and automate decisions—in real-time and based on the latest data. That means operational intelligence use cases require data integration to live operational systems, not just static historical data from a big data warehouse. Here are some examples:
• Customer 360 for call center and customer self-service
• Next-best-action recommendations (for example, customer churn)
• Product recommendations
• Real-time fraud detection
• Credit scoring and approvals/rejections
• Network monitoring and intrusion detection
• Artificial intelligence and automated real-time decisioning
We know why consistent, accurate data is essential, but why does preparing it take so long? We need only look at the multitude of steps involved to see why.
Only after all these steps can you consume and trust the resulting data—with confidence in the results or insights you will receive.
The above steps are required to get data from different internal systems and external sources into a form usable by the target application. Even if the steps are scripted and semi-automated for offline use cases, they have to be scheduled and initiated well in advance. On the other hand, operational use cases generally require complex real-time integrations to dozens of enterprise systems, as well as coding of all these steps for rapid in-the-moment execution.
But that’s just the beginning. There are other data management-related problems that make data preparation that much harder.
If the enterprise could automatically access all the quality, up-to-date data it needs for any use case that requires big data preparation, how much different would the business landscape be? Having consistent, accurate, and timely data available for both analytical and operational use cases would make things better in so many ways:
The enterprise needs a single data preparation solution for offline and real-time operational use cases. The biggest obstacle to realizing these benefits is the lack of simple-to-use, self-service tools that can quickly and consistently prepare quality data. That’s because most big data is prepared database by database, table by table, and row by row, joined with other data tables through joins, indexes, and scripts. For example, all invoices, all payments, all customer orders, all service tickets, all marketing campaigns, and so on, must be collected and processed separately. Then complex data matching, validation and cleansing logic is required to ensure that the data is complete and clean–with data integrity maintained at all times.
A better way is to collect, cleanse, transform, and mask data as a business entity—holistically. So, for example, a customer business entity would include a single customer’s master data, together with their invoices, payments, service tickets, and campaigns.
K2View Data Fabric supports data preparation by business entities.
It allows you to define a Digital Entity schema to accommodate all the attributes you want to capture for a given business entity (like a customer or an order), regardless of the source systems in which these attributes are stored. It automatically discovers where the desired data lives across your enterprise systems, and it creates data connections to those source systems. K2View Data Fabric collects data from source systems and stores it as an individual digital business entity, each in a separate encrypted micro-database, ready to be delivered to a consuming application, big data store, or user.
The solution keeps the data synchronized with the sources on a schedule you define. It automatically applies filters, transformations, enrichments, masking, and other steps crucial to quality data preparation. So, your data is always complete, up-to-date, and consistently and accurately prepared, ready for analytics or any operational use case you can conceive.
Looking at your data holistically at the business entity level ensures data integrity by design, giving your data teams quick, easy, and consistent access to the data they need. You always get the insights and results you can trust because you have data you can trust.