Data integration vs. ETL in the age of big data

Data Management.com

Data integration vs. ETL in the age of big data

By Andy Hayler

Getting a consistent view of business performance across a large enterprise is a thorny problem. Often, global corporations lack a single definitive source of data related to customers or products. And that makes it difficult to answer even the simplest questions. Data integration could be the answer.

Data integration provides a unified view of data that resides in multiple sources across an organization. Extract, transform and load (ETL) technology was an early attempt at data integration.

With ETL, the data is extracted, transformed and loaded from multiple source transaction systems into a single place, such as a corporate data warehouse. The extract and load parts are relatively mechanical, but the transform portion isn't as easy. For this to work, you need to define business rules that explain which transformations are valid.

One of the main distinctions in the question of ETL versus data integration is that data integration is a broader creature. It can include data quality and the process of defining master reference data, such as corporatewide definitions of customers, products, suppliers and other key information that gives context to business transactions.

Data classification and consistency

Let's look at one example. A large operating company might need several levels of classifications for products and customers to segment marketing campaigns. A smaller subsidiary of the same company could do this with a simple hierarchy of products and customers. In this example, the broader organization may classify a can of cola as a carbonated drink, which is a beverage, which is part of food and drink sales. However, the smaller subsidiary may lump the same cola can into food and drink sales without the intermediate classifications. This is why there needs to be consistency of classification -- or at least an understanding of what the differences are -- to get a global view of overall companywide sales.

Unfortunately, the simple act of knowing who you're doing business with isn't always that simple. For example, Shell U.K. is a subsidiary of the oil giant Royal Dutch Shell. Companies like Aera Energy and Bonny Gas Transport are entities of Shell -- some with other investors. So, business transactions with those companies need to add up into a global view of Shell as a customer, but the relationship is not obvious from the company name.

A vice president of a famous investment bank once told me they had no idea how much business they did on a global basis with, for example, Deutsche Bank, let alone whether or not that business was profitable, as the answers to such questions were buried within the systems of the various global investment bank divisions.

Data quality issues

ETL technology was an early attempt to help with this problem. But to get the transformation step right, you need to define business rules that lay out what transformations are valid -- for example, how to aggregate sales transactions or mapping a database field where "male" is used to another where "m" is used to define a male customer. Technologies were developed to help with this process.

It turns out that achieving integrated data is broader than just ETL versus data integration. Consider data quality. What if it turns out that there are duplicates in customer or product files? For one project I worked on, nearly 80% of the apparent customer records were duplicates. This meant the company had just one-fifth the number of business customers it thought it had.

In materials, master file duplicate rates of 20% to 30% are the norm. Such anomalies should be eliminated when the data is aggregated for a corporate overview.

Ever-increasing volumes of data

Even though data integration has its advantages for large corporations, it's not without its challenges. The amount of unstructured data that corporations produce continues to grow.

And, because data is held in different formats -- sensor data, web logs, call records, documents, images and video -- ETL tools can be ineffective, because they weren't designed with these factors in mind. These tools also struggle when there are high volumes of data or big data. Certain tools like Apache Kafka attempt to address this issue by streaming data in real time. This allows them to overcome limitations in former message bus approaches to real-time data integration.

The world of data integration has evolved since the early days of ETL. But it needs to continue its evolutionary path to keep pace with the changing needs of organizations and the big data revolution.

11 Oct 2019