Petya Petrova - Fotolia
The terms "burning fire," "advance planning," "all together" and "invited guests" are all pleonasms. A pleonasm is an excessive use of words or redundant phrasing. We can add another one that has evolved over the past few years: "logical data warehouse." It qualifies as a pleonasm because present-day data warehouses can be developed as logical data warehouses as a matter of course. Let me explain what I mean.
"Logical data warehouse" was introduced as a term by Gartner. Since then, it has been used by many others, including myself. The idea is that a data warehouse doesn't have to be one physical database. It can be a heterogeneous set of data sources that each contains a fragment of the data end users need for business intelligence, reporting and analytics applications, but it presents itself to them as a single data source. So, the logical data warehouse is a system architecture that pretends all the data is stored in one big database.
That is different from how the data warehouse was introduced in the 1990s. For example, Barry Devlin's excellent book Data Warehouse: From Architecture to Implementation includes the following definition:
A data warehouse is simply a single, complete and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use in a business context.
Clearly, this definition indicates that the data warehouse is one store (read: database). Ralph Kimball's definition also hints at that:
A data warehouse is a copy of transaction data specifically structured for query and analysis.
Several alternative definitions -- for example, Wikipedia's definition -- also indicate or insinuate that a data warehouse is a single database.
Historical data historically kept together
Over the last 20 years, most data warehouses were implemented according to these definitions -- with all the enterprise data for reporting and analytics stored in one physical database. Companies went that route primarily because of the hardware and software technology that was available to them. It was necessary to store all the data in one place in order to develop a fast and efficient data warehouse. Designers did their best with the limitations of contemporary technologies.
But should we still put all our data eggs in one basket? Especially when we consider the increasing popularity of big data, the fact that more and more data is stored in the cloud and the growing need to analyze unstructured data, the one-database standard is becoming more and more impractical.
In addition, hardware and software technology has progressed. Technologically, there is no longer any need to store everything in one database to present an integrated view to users. For example, mature data virtualization servers now offer advanced data federation and on-demand data integration capabilities that can make multiple data sources look like one big data warehouse, regardless of whether they're SQL databases, Hadoop clusters, NoSQL systems, cloud-based applications or Web services.
I think we can all agree that a data warehouse should supply users with the right data at the right time and at the right quality level to support effective business decision making. It's of secondary importance where all that data is stored, as long as the data is easily accessible and an integrated view of it is presented. That makes the data warehouse a logical concept, not a physical one.
Name game on data warehouses
But what does that mean for defining what a data warehouse is? Let's look at Bill Inmon's popular definition:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision-making process.
Note that Inmon refers to a collection of data. He doesn't say it has to be stored in one database. It's very easy to modify his definition a little to explicitly express that a data warehouse is a logical concept without fundamentally changing its meaning:
A data warehouse is a system that presents a subject-oriented, integrated, time-variant and reproducible collection of data in support of management's decision-making process.
In this altered definition, a data warehouse may be implemented with one database, but it could also be implemented with many data sources. We can make similar minor changes to most other definitions of a data warehouse to get a comparable effect. If you agree with this, then the term "logical data warehouse" becomes a pleonasm, and "data warehouse" alone is sufficient to convey the inherent logical concept.
In the world of technology, it's not uncommon to update the definitions of concepts. For example, take the word "camera." There was a short period of time in which we used the phrase "digital camera," but nowadays almost every camera is a digital camera. The same with television sets: Currently, we use the term "smart TV," but before long that will become a pleonasm as well. As technology evolves, definitions and terms must adapt accordingly.
Therefore, my recommendation is to stick to plain "data warehouse." Forget about the outdated idea that the data warehouse must be one big database. What's important is that all the data is presented as one big database. And the technology now exists to combine various data sources and make them look like one.
About the author:
Rick van der Lans is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy; email him at [email protected].
More from Rick van der Lans: Why the battle for data supremacy depends on big data tools and execs who understand them
Get tips on evaluating and selecting SQL-on-Hadoop engines
Learn why data integration processes need to change to keep up with distributed processing