Getty Images/iStockphoto

What makes up an analytics pipeline?

Analytics pipelines were traditionally hidden away, but they're changing as more organizations focus on agility for their data. Learn what makes up a successful analytics pipeline.

By

Lisa Morgan
Ben Cole, Executive Editor

Published: 21 Nov 2022

In today's data-driven economy, companies can't afford to have data-related issues, but many still do. Despite the exploding volume of data organizations continue to amass, they're still having trouble accessing and using that data.

To accelerate the speed and accuracy of data analytics insights, data engineers are constructing data analytics pipelines -- or data pipelines -- to operationalize data.

What is a data analytics pipeline?

An analytics pipeline streamlines data flow to improve the speed and quality of insights. Similar to a continuous integration/continuous delivery (CI/CD) pipeline used by a DevOps team, the speed advantage of an analytics pipeline hinges on automating tasks.

"If the owner of a finance group asks me for a cash flow report, I may have to extract the data manually [and] update that record myself," said Dan Maycock, principal of engineering and analysis at hop farm Loftus Labs. "When I'm manually extracting data every time it's requested, it doesn't happen as frequently. If I have a pipeline, that's happening automatically."

According to Pieter Vanlperen, managing partner at PWV Consultants, a process modernization consultancy, other things that require at least some automation in the analytics pipeline include data governance, data quality, data usability and categorization, depending on how advanced the pipeline is.

Having more than one analytics pipeline is common for various reasons, as each may serve a different purpose. Colleen Tartow, director of engineering at Starburst Data, a distributed SQL query engine platform provider, said data engineering is critical to pipeline function as they are often complex and vary in maturity.

"You could have a straightforward cloud-native pipeline using a modern data stack, or you could have a data center-based infrastructure that requires constant management alongside the actual data pipeline itself," she said.

Maycock uses one pipeline to transport data from its original source to a central repository and another pipeline to transport data from the central repository to a map, BI tool or data model.

"In the early 2000s when I started, you were pretty much on your own building and maintaining [pipelines], but that isn't the case anymore," he said.

Chart showing the 5 analytics modes

Other benefits of an analytics pipeline

Analytics pipelines can help organizations achieve higher levels of agility and resiliency, especially when they're built iteratively.

"The idea is that you're iterating on your designs through the canvas on which the pipeline is built. The benefit is higher productivity," said Arvind Prabhakar, CTO of StreamSets, a DataOps platform provider.

Analytics pipelines, like CI/CD pipelines, also provide visibility across the engineering and operations functions, which enables continuous feedback loops, faster iteration and quicker issue resolution. According to Prabhakar, the previous generation for platforms and tooling treated data operations as hidden workloads.

"In this new world of DataOps where every end point, every pipeline is [potentially] the weakest link, you need the ability to constantly monitor and manage because the pipelines themselves are a reflection of how your data architecture is evolving," Prabhakar said.

And cross-functional visibility into the analytics pipeline can help enable process improvements. Data observability makes sure business needs and processes are modeled in the analytics pipeline as well, Prabhakar said.

"These pipelines are not just artifacts of the design choices that data engineers made," he said. "They actually reflect business processes that are engrained in the fabric of the enterprise's data architecture."

Analytics pipeline scalability

Scalability is essential so the data analytics pipeline can adapt to growing data volumes. However, it is also important to consider not only scalability, but also how to integrate with existing analytics capabilities in data architecture.

When building a scalable data analytics pipeline, consider both input data and output data. Knowing the context of input data and how much can help determine the format to store the data and the technology to do so. Consider end users when it comes to output data. Data analysts rely heavily on this information, so the output data must be accessible and transparent for them.

Also consider how much data the analytics pipeline can ingest. Infrastructure must be able to handle a sudden change in data volume, for example, due to business growth. One option is to set up the pipeline in the cloud to allow for further flexibility and, ultimately, scalability.

Challenges with creating an analytics pipeline

The point of an analytics pipeline is to expedite the delivery of data, but a common obstacle is the data itself.

"I might have built a pipeline, but I really don't have any more information because the data warehouse or the data lake I built is so poorly governed that it's a swamp," Vanlperen said.

He said poor governance can quickly make data unusable. It's important to understand which data sources are important and tweak them so they can be useful, he said.

The diversity of data sources can also be problematic.

"Every software platform can have its own API and their own data model [because] there's not necessarily a role in software development specifying how data is presented to a data pipeline or an ETL platform," Maycock said. "Being able to connect to and extract data, depending on how foreign that platform is, can be somewhat difficult, as well as being able to access the information in a consistent way."

Another issue organizations face is that no one is responsible for understanding the full inventory of what data is available in-house and from third-party sources. Some argue that's a telltale sign of needing a chief data officer or at least someone responsible for understanding and operationalizing data.

"Ten years ago, the data engineer was expected to know everything, and they were given a big docket which contained all the specifications of the data infrastructures," Prabhakar said. "Now, the data engineer has no clue of where the data is coming from, who owns it [or] where it originated, let alone the schema, structure and semantics."

Also 10 years ago, data engineers and operations personnel often worked in data silos, which should no longer be the case because disconnects between groups can create friction that slows value delivery. Cross-functional disconnect can also negatively impact business operations. For example, if the analytics pipeline starts losing 10% data, the downstream analytics results would be dubious.

"When you talk about continuous operations, the goal of the pipeline is to establish a tight feedback loop between the data engineers and the operators," Prabhakar said. "You want the pipelines to automatically start raising a flag that something has changed."

Analytics pipelines are essential for any insight-driven organization. When designed and implemented well, they can help a company meet its strategic goals sooner.

Dig Deeper on Business intelligence architecture and integration

Data Management

12 top open source databases to consider
Open source databases are viable alternatives to proprietary ones. Here's information on 12 open source and source available ...
TigerGraph unveils GenAI assistant, introduces new CEO
Under the leadership of Hamid Azzawe, the graph database specialist's new copilot and platform update target new users beyond its...
The 5 components of a DataOps architecture
Reaping the benefits of DataOps requires good architecture. Use five core components to design a DataOps architecture that best ...

AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...
Compare EKS vs. self-managed Kubernetes on AWS
AWS users face a choice when deploying Kubernetes: run it themselves on EC2 or let Amazon do the heavy lifting with EKS. See ...

Content Management

Benefits and challenges of a headless CMS
Headless CMSes enable omnichannel publishing and improve front-end flexibility. Yet, these platforms can have steep learning ...
7 SharePoint problems that spur customers to leave the platform
SharePoint is a well-known content management and collaboration platform. Despite its popularity, it can introduce many ...
5 benefits of enterprise search
With a proper enterprise search strategy in place, organizations can improve their employees' efficiency and ensure customers ...

Oracle sets lofty national EHR goal with Cerner acquisition
With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...
With Cerner, Oracle Cloud Infrastructure gets a boost
Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...
Supreme Court sides with Google in Oracle API copyright suit
The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

SAP earnings for Q1 indicate strong cloud growth
SAP's cloud revenue for the first quarter of 2024 indicates healthy growth and sets the stage as customers plan cloud migrations ...
SAP chief AI officer: Waiting on AI is the wrong strategy
SAP's first chief AI officer, Philipp Herzig, outlines the company's new AI-focused organization and underscores why companies ...
SAP, Nvidia partner to boost Business AI development
SAP and Nvidia are working together to combine platforms and services that help customers build business-specific generative AI ...

Close