Applying agile methods to data warehouse projects

Agile development processes can take a lot of the pain out of building data warehouses and enable project teams to deliver functionality, and business value, on a rolling basis.

Rapidly gaining in popularity, the Agile approach to data warehousing solves many of the thorny problems typically...

associated with data warehouse development—most notably high costs, low user adoption, ever-changing business requirements and the inability to rapidly adapt as business conditions change. The Agile approach can be used to develop any analytical database, so let’s begin with two familiar definitions:

A Data Warehouse (DW) is simply a database that contains integrated and homogenized information from one or more sources brought together to support analysis and reporting. These sources can be your internal online transactional processing (OLTP) systems such as finance, accounting, sales, marketing, payroll, supply chain, etc., or external sources such as supplier files, purchased marketing lists, Facebook, Twitter or census data, etc. 

In addition to the data warehouse, you may also be using additional types of databases for analysis and reporting. The most common types include data marts and operational data stores (ODSs). 

Business Intelligence (BI) refers to the different mechanisms used to leverage and interact with the data stored in the databases. Types of BI applications include query, reporting, analysis, dashboards, statistics, predictive models and key performance indicators (KPIs), etc.  

These two concepts of BI and data warehousing are depicted in Figure 1. This article focuses on applying Agile methods to the creation of the databases. 

In order to simplify the discussion, I will use the generic term analytical database to refer to all types of data stores—including data warehouse, data mart, operational data store, etc. 

Why Agile? 

There are several key reasons why Agile methods are well suited to building analytical databases. Among them are business-driven vs. data-driven development, and reduced risk and complexity. 

Agile Methods Take a Business-Driven Approach. Simply stated, creating analytical databases is complex, time-consuming and oftentimes overly expensive, especially when traditional data-driven methods are used. A fundamental truth of BI and data warehousing is that data integration and homogenization account for 70% to 80% of the project budget and an even higher percentage of the risk. 

Let’s take a simple example. Assume your organization has four internal OLTP systems and one external data source. On average, each system has 30 database tables, and each table contains 30 columns. This means that: 

(4 OLTP systems + 1 external system) 
x 30 tables x 30 columns 
= 4,500 data elements 

In a data-driven approach, it’s not uncommon to want to integrate and homogenize most—if not all—of the data before the first query or report can be written. This means that 70% to 80% of the project budget will be expended before any business value can be realized. Similarly, integrating thousands of fields can take upwards of 12 months to complete. What this means is that your goal should be to minimize the amount of effort associated with data integration and homogenization. 

In an Agile delivery model, only data needed to answer specific business questions or to solve specific business problems is sourced. (These need statements are captured in a series of business “stories.”) So instead of first trying to “boil the ocean” via a massive data integration effort, Agile practitioners work with the business community to define the hundred or so data elements that drive performance. This means that the business will be in a position to receive value much more quickly—in weeks or months rather than quarters or years. 

Agile Methods Reduce Risk and Produce Systems with High Adoption Rates. Organizations that apply traditional waterfall methods to BI/ DW projects accrue unnecessary risk and may find out what they’ve created does not satisfy the business’s needs. Waterfall methods mean it’s all or nothing. In other words, design cannot begin until all of the requirements are defined, and coding cannot begin until design is complete. This means that the project takes on ever-increasing levels of risk and that business value is delivered at the end of the project (see Figure 2, page 10). 

It’s not unusual to find that once an analytical database has been deployed using a data-driven and waterfall approach, it suffers from low adoption and usage. The primary culprit seems to be that business needs and priorities will have shifted between the time the requirements were originally defined and when the analytical database was deployed. 

For the most part, the delivery team’s efforts are seen as a black hole. What ends up being delivered is based on assumptions and interpretations of the requirements and may not contain the information the business really needs. Reality hits when that first report is written, oftentimes making the data integration effort completely moot. 

When Agile methods are applied, value can be shown on a recurring basis. The key tasks of database design—data quality remediation, and data integration and homogenization— are broken into short, timeboxed and scope-boxed delivery cycles, or “sprints,” that generally last two to four weeks each. These data-focused tasks are paired with prototyping in the BI layer, allowing the business to interact with the data multiple times, helping to assure that the analytical database truly contains useful information (see Figure 3, page 11). 

The use of delivery sprints keeps business value at the forefront and drives project risk down to a mini-mum. At the end of each two- to four-week sprint cycle, the delivery team is required to demonstrate what they have produced, making their work much more visible to the business and allowing for midcourse corrections because another tenet of Agile is to “fail quickly.” 

Business Value Delivered Quickly

Agile methods can easily be applied to the building of analytical databases— data warehouses, data marts, operational data stores, etc. When the work associated with designing the databases and integrating the data is driven from the business side, project costs can be held to a minimum and business value can be delivered more quickly. By organizing the work into short sprint cycles, everyone can be assured that the expensive and risky part of the project— data integration and homogenization— will surely meet the business’s ever-increasing appetite for impactful and actionable information rather than suffer from misinterpreted requirements and low business adoption. 

About the author:
Jim Gallo is a senior business intelligence and data warehousing consultant with Information Control Corp. Gallo has led BI projects for Fortune 1000 companies, federal and state governments, and international clients. In addition, he has published numerous articles on the practical realities of BI and data warehousing and is a regular conference speaker. He can be contacted at [email protected]

Dig Deeper on Big data analytics