K2VIEW eBOOK

What is Data Integration?

The Complete Handbook 

Data integration is the process of unifying datasets from multiple sources
to provide a single view of the data for analytics and operational intelligence.

Data integrationi mage

INTRO

Data Integration Overview

Data integration tools unify data from different sources into a single view. They make trusted data more accessible, and easier to consume, by: 

  • Enhancing operational efficiency, by reducing the need to manually transform and combine datasets 
  • Improving data quality, via automated transformations that apply business rules to data 
  • Facilitating more valuable insights, through a 360-degree view of the data that can be more easily analyzed 

According to analyst Gartner, data integration has traditionally been achieved by a combination of related tools, with different vendors offering different tools. 

Although the primary market for data integration tools is ETL, there are overlapping submarkets of tools for data replication, and data federation (EII), as well as periphery markets of tools for data modeling, data quality, and adapters. 

The problems associated with enterprise data integration are a direct result of this market confusion, with:  

  • Different teams using different tools
  • A complete lack of consistency
  • Lots of overlap and redundancy
  • No common management (or leveraging) of metadata 

Enterprises have therefore been forced to amass a wide range of tools, from multiple vendors, to address all of their data integration needs.

As more and more organizations demand cross-company data integration, a new breed of highly focused vendors are stepping up to meet the market need with innovative technologies.

The market is now ripe for an all-in-one data integration platform, supporting all data integration styles, based on common design tooling, metadata, and runtime architecture. 

Chapter 01

Data Integration Trends and Drivers

In the Gartner Magic Quadrant for Data Integration Tools for 2022, the following 10 trends and drivers are listed:

1. Year over year growth

The data integration market grew from 6.8% in 2020 to 11.8% in 2021, with a 5-year Compound Annual Growth Rate (CAGR) increasing from 7.0% to 8.5%. Cloud adoption is expanding, with the iPaaS market growing by 40.2% in 2021.

2. Market disruption by small vendors

The top 5 vendors in the data integration market had a combined market share of 71% in 2017. By 2021, their collective market share dropped to 52%. One explanation for this phenomenon is that smaller vendors are disrupting the market with more focused, flexible offerings.

3. Demand for modern data delivery styles and cloud data integration

Vendors gaining market share are focusing on data virtualization, data replication, or cloud data integration. The 5 top vendors in this group collectively grew their revenue by 32% in 2021. In order for the larger vendors to combat their drop in market share, they must find the right tradeoff between all-encompassing platforms, and easily accessible point solutions.

4. Augmented data integration with data fabric

Data fabric answers the immediate need for augmented data integration. By taking a data-as-a-product approach, data fabric architecture automates the design, integration and deployment of data products, resulting in faster, more automated data access and sharing.

5. Decentralization with data mesh

Enterprises exploring data mesh need decentralized, domain-oriented delivery of data product datasets. Business teams are looking to data mesh architecture to:
 Access the data they need on their own, to meet their SLAs more effectively

 Design data products based on their own, unique subject matter expertise
 Benefit from semantic modeling, ontology creation, and knowledge graph support

6. Cost optimization through financial governance

Data integration tools need to be able to track and forecast all the costs associated with cloud integration workloads. Data teams should be able to connect the cost of running data integration workloads to business value, and allocate processing capacity accordingly. Data integration tools should therefore support financial governance, including price-performance calculations.

7. Data engineering support

Data engineering translates data into usable forms by building and operationalizing a data pipeline across data and analytics platforms, making use of software engineering and infrastructure operations best practices. With more data infrastructure operating on the cloud, platform operations are becoming a core responsibility.

8. DataOps support

Data integration tools should support DataOps enablement. DataOps is focused on building collaborative workflows to make data delivery more predictable. Key features include the ability to:
 Deliver data pipelines through CI/CD
 Automate testing and validate code
 Integrate data integration tools with project management and version-control tooling
 Balance low-code capabilities
 Manage non-production environments in an agile way

9. Hybrid and inter-cloud data management support

Cloud architectures for data management (iPaaS), span hybrid, multi-cloud, and inter-cloud deployments. There are advantages and disadvantages in managing data across diverse and distributed environments. Data location affects:
 Application latency SLAs
– Data sovereignty
– Financial models
– High availability and disaster recovery strategies
– Performance

According to Gartner, almost half of all data management implementations use both on-prem and cloud environments. Support for hybrid data management and integration has therefore become critical – and today’s data integration tools are expected to dynamically construct, or reconstruct, integration infrastructure across a hybrid data management environment.

10. Prevention of application/CSP/database lock-in

There’s a definite need for independent data integration tools that do not rely on the storage of data in a vendor repository or cloud ecosystem. Embedded data integration capabilities, delivered by vendors, could lead to potential lock-in issues, costs for leaving the service, and data silos. Although some native CSP data integration tools allow for two-way integration (to and from their own cloud data stores), any organization using more than one cloud service provider should acquire independent data integration tools.

 

Chapter 02

How Data Integration Works

Simply put, data integration routes data from source to target systems. Sometimes, the data is physically relocated to the target system (e.g., inflight integration of multiple data streams). At other times, source datasets are copied onto a big data store (e.g., consolidation of transactional data for analysis in a data lake). 

System architects and software developers design programs that automate and manage data integration. While some types of data integration are fairly straightforward, like replicating data from one system to another, other types require the harmonization of different database schemas found in different source systems. 

A common harmonization approach is to create a mediated schema that merges smaller, local source schemas into a larger, global schema – used to map the data elements back to the mediated schema. This can be done in a target system, such as a data lake, or in a virtualized environment – that delivers a unified view of the data from all its different systems, without physically loading it into a new data store. 

Chapter 03

Types of Data Integration

The most common type of data integration is Extract, Transform and Load (ETL), where data is: 

  • Extracted from its source systems, 

  • Transformed (consolidated and filtered) for use in analytics workloads, and then  

  • Loaded into a data warehouse.  

ETL is a batch process that typically involves massive amounts of bulk data. 

An alternative to ETL is the Extract, Load and Transform (ELT) type of data integration, in which  the 2nd and 3rd steps of the ETL process are inverted. With ELT, raw data is: 

  • Extracted from its source systems, 

  • Loaded into a target system, and then  

  • Transformed, as required, for individual analytics applications.  

ELT is often favored by data scientists, who typically prepare data themselves, in order to access the data they need – for machine learning apps, predictive modeling, and other variations of advanced analytics – when they need it. 

A good answer to the ETL vs ELT question is entity-based ETL (eETL), available on real-time data integration platforms, like Data Product Platform

Other types of data integration include: 

  • Change Data Capture (CDC), which syncs data updates between source and target systems in real time; 

  • Data streaming, which integrates real-time data streams, and stores the newly combined datasets into databases, for operational and analytical use cases; and 

  • Data replication, which copies data from system to another, for use in operational, backup and Disaster Recovery (DR) workloads. 

  • Data virtualization, which gives business users and data analysts an integrated view of different datasets without having to rely on IT to load the data into a data warehouse, operational database, or any other target system.  

    Data virtualization tools are growing in popularity because they can augment existing analytics architectures for particular applications, or be used in conjunction with a data warehouse that may include a mix of different platforms. 

Chapter 04

Data Integration Use Cases

The most common data integration use cases include: 

  • Building data stores
    Large enterprises use data integration to build data stores (data fabric vs data lake vs database) to conduct analysis, generate reports, operate queries, and provision data in a consistent manner. 

  • Leveraging big data
    Big data enterprises, like Amazon, Facebook, and Google, generate valuable business information from billions of users each day. Therefore data integration tools are a necessity for today’s businesses. 

  • Generating operational intelligence
    By delivering a unified view of data from multiple sources, data integration becomes the foundation for your operational intelligence platform. With more information at their disposal, analysts can make more accurate predictions, at any scale. 

Chapter 05

Data Integration Benefits

There are several benefits associated with data integration, including: 

  • Better collaboration
    While different business domains demand access to company data for shared and individual projects, IT demands a secure way to provide self-service data access. In the sense that domain experts are generating data that the rest of the company needs, data integration improves collaboration. 

  • Fewer errors
    To manually gather data, data teams have to know where to look, and what tools to use, to assure the completeness and accuracy of their datasets. Without data sync, reporting must be redone periodically to account for any changes. But with automatic updates, reports can be generated in real time, whenever they’re needed. 

  • Greater efficiency
    By automating unified views of company data, data teams no longer need to manually create connections whenever they have to build an application, or generate a report. The time saved on hand-coded data integration can be used more efficiently on analysis and execution, for greater productivity and competitiveness. 

  • Cleaner, fresher data
    Data integration tools can improve the value of a company’s data over time. With data integrated into a centralized repository, data teams can spot quality issues, and fix them quickly and easily – resulting in higher quality data, and, ultimately, better data analysis. 

Chapter 06

Data Integration Challenges

Data_Integration-3_Grafana-Monitoring-3The 5 most common challenges data teams face implementing data integration include:

  1. Attention to detail
    Before using their data integration tools, data teams need to be fully aware of:
    – What data needs to be collected and analyzed
    – Where that data is stored
    – Who are the stakeholders (data engineers, data architects, data scientists, data consumers in business and IT)
    – Which systems are using the data
    – How is the analysis conducted
    – When the data needs to be updated
    – What reports have to be generated 

  2. Data integration with legacy systems 

    Data stored in legacy systems often lacks the times and dates of activities, which modern systems typically include. 

  3. Data integration with newer systems 

    Newer systems generate structured and unstructured data, in batch or real time, from a wide variety of sources, including the cloud, IoT machines, sensors, and video. Integrating this diverse array of information may be crucial for the business, but also extremely challenging for IT – in terms of accommodating for scale (massive volumes), performance (sub-second response times), and all the different formats and technologies out there today. 

  4. Data integration with external sources 

    Data extracted from external sources may not come with the same level of detail as internal sources, making it difficult to analyze with the same level of precision. Data sharing may also be an issue, due to contractual obligations.  

  5. Data compliance 

    Data teams would be well advised to adhere to data integration best practices, in terms of complying with the quality demands of the organization, and privacy laws enacted by the regulatory authorities.

Chapter 07

Data Integration Requirements

When choosing your data integration tools, make sure they’re: 

  • Cloud-compatible
    As organizations migrate to hybrid cloud environments, make ensure your data integration tools can be run in single-cloud, multi-cloud, or hybrid-cloud environments. 

  • Cost-effective
    Make sure your data integration tools come with automatic updates, and parameter changes (increasing the volume of the data, or number of connectors), free of charge. 

  • Easy to use
    Today’s data integration tools should be as intuitive as possible, with a friendly GUI interface for simpler visualization of your data pipelines. 

  • Equipped with lots of connectors
    There are many applications and systems on the market today, so the more connectors your data integration tools have, the better off you’ll be (in terms of saving time and resources). 

  • Open source
    An open source architecture delivers greater flexibility, and helps avoid vendor lock-in. 

Chapter 08

Data Integration via Data Products

Data-IntegrationData-Integration-2_Enterprise-1-1A data product approach to data integration processes and delivers data by business entities. When data integration tools are based on data products, they allow data engineers to build and manage reusable data pipelines, for both operational and analytical workloads.   

Entity-based tools auto-discover and model the data product schema, to define all the fields and tables for a specific business entity (e.g., customer, location, device, etc.) from all existing systems – regardless of the underlying technologies or formats.  

A data product ingests the data for each instance of a business entity into its own individually encrypted Micro-Database™.  

It then applies data masking tools and data orchestration tools – inflight – to make the individual entity's data accessible to any authorized data consumer, in any delivery style, with full compliance. 

Chapter 09

Data Product Platform with Data Integration Built In

Data Product Platform integrates and delivers data from any data source (on-prem or cloud) to any data consumer, using any data delivery method: bulk ETL, data streaming, data virtualization, log-based data changes (CDC), message-based data integration, reverse ETL, and APIs.   

The platform has a no-code data transformation tool for defining, testing, and debugging data logic and flows, designed for every need:  

  • Basic, including data type conversions, numerical calculations, and string manipulations.  

  • Intermediate, such as Find/Replace functionality, that accesses aggregatied data, data matching, reference files, and summarizations. 

  • Advanced, like cleansing, combining (data and content sources), correlating, enriching, parsing, processing unstructured data, mining text, and validating data. 

As opposed to conventional ETL vs ELT approaches, the platform makes use of an in-memory database to transform data in real time. Data transformation is applied at the level of the individual business entity, marked by the ability to respond in milliseconds and to create a massive-scale data pipeline.

Chapter 10

Recognition for Data Product Platform

84-Gartner_blog _MQ -1-1

According to the 2022 Gartner Magic Quadrant for Data Integration Tools report, Data Product Platform:

  • Aligns with common business outcomes and use cases – like customer 360, data masking, test data management, data migration, legacy application modernization – and operational intelligence for real-time insights (such as churn prediction, credit scoring or fraud detection).​
  • Supports a distributed and scalable architecture, where the dataset for each business entity is managed in its own Micro-Database™. This allows the platform to concurrently manage billions of Micro-Databases with near-real-time data movement, including the ability to apply in-flight data transformations.​
  • Offers dynamic data virtualization, in which some data can be accessed from the source, via virtualization, and other data can be accessed from a Micro-Database, which is asynchronously synced with the sources. This ability to decide which data will be stored physically, and which will be virtualized, provides advantages in terms of performance optimization and load management on source systems.​

Chapter 11

Summary

In this paper, we define data integration, discuss its trends and drivers, inner workings, types, use cases, benefits, challenges, and requirements – pretty much the same material covered by everybody else. 

 But we go beyond the standard descriptions, by introducing an entirely new concept: Integrating data via business entities (in the form of data products). This approach has tremendous advantages, especially in terms of data quality, scale, and speed. 

Finally, we reveal a commercially available Data Product Platform, already in use by some of the world’s largest enterprises, where data integration is a core component. The platform debuted in the 2022 Gartner Magic Quadrant for Data Integration Tools report, with excellent reviews.

Explore enterprise
data integration

Get Demo