Guide to big data analytics tools, trends and best practices
A comprehensive collection of articles, videos and more, hand-picked by our editors
What an exciting time to be a business intelligence or data warehouse implementer -- so much new and groundbreaking technology from which to choose! But with technological innovation comes the inevitable disruption to existing architectures, techniques and traditions. Data warehouse and BI environments aren't immune to that.
A next-generation data warehouse and BI architecture is emerging from all the tumult. This new architecture includes the need for:
- Advanced analytics capabilities, such as statistical and predictive analyses, real-time analysis on real-time data, and sophisticated data visualization.
- Enhanced management of new and unusual data sources (the overly hyped big data) through new concepts such as data refineries (aka data lakes or hubs) and the use of data virtualization or data blending tools to augment standard extract, transform and load approaches to data integration.
- New deployment options such as the cloud, mobile devices, and integrated hardware and software appliances.
To the need for new technologies add the increasing pressure in enterprises to generate more immediate business insights and, at the same time, reduce the overall cost of these expanding environments.
It's no wonder that many technologists are confused about how and where these new capabilities fit into their existing BI and data warehouse environments. Does the enterprise data warehouse (EDW) still have a role? Where does that Hadoop "thingie" fit in? How can they satisfy the enterprise's increasing need for real-time analytics? To answer these questions, we present what we call the eXtended Data Warehouse architecture (XDW).
Data warehouse still a BI workhorse
Let's address the first question: Does the EDW still have a place in the new BI architecture? The answer is a resounding yes -- at least for the foreseeable future. Its role is changing somewhat, into being the source of established and standardized production reports, comparisons and analytics. But the data warehouse is still the best source of integrated, high-quality data for critical or sensitive BI analyses to help meet financial, compliance or regulatory requirements and for standard BI dashboard components like key performance indicators and other business metrics used by operations, marketing, sales and other departments. Nothing beats this workhorse for those important BI deliverables.
However, the traditional data warehouse architecture shown in Figure 1 can't do everything that's needed these days. The XDW recognizes the fact that the EDW does have its limitations -- especially when it's dealing with new types of data, experimental or investigative analyses, or real-time data analysis.
Now let's answer the second question: Where do Hadoop and other new technologies fit in? The innovations in both relational and non-relational data management platforms demand that we move outside of the traditional EDW and add new components to our BI architecture.
New tools provide the means to investigate data
Figure 2 shows the three main components that we've identified for extending the EDW environment to support next-generation BI capabilities. The first is the investigative computing platform. That's where the innovations in relational software and Hadoop technology really shine. The platform is used for exploring data and developing new analyses and analytical models -- potential applications include data mining; cause-and-effect analysis; what-if exploration; pattern analysis; and general, unplanned investigations of data.
Some organizations may use the investigative computing platform only as a simple sandbox for experimentation; others may create a full analytics platform or use it as an extension of the data refinery (described below). This new component gives companies the ability to freely and quickly analyze and/or experiment with large volumes of data with phenomenal performance. The output from these activities could then be used by the EDW, a real-time analysis engine in the operational environment or standalone line-of-business applications.
Raw data requires some refining
The second new component is the data refinery. Its purpose is to ingest raw, detailed data in batch mode and/or in real time from new and varied sources of big data -- sensors, social media, radio frequency identification (RFID) tags, etc. -- and load it into a managed relational or non-relational data store. Just like an oil refinery turns crude oil into petroleum products, the data refinery distills raw data into useful and usable information that it then distributes to the investigative computing platform or the EDW. The refinery often requires more flexible data governance policies on data security, privacy, quality, archiving and destruction than those found in the data integration platform within a traditional EDW architecture.
The final extension to that architecture answers our third question, about how to provide real-time analytics capabilities. This component consists of a real-time analysis platform found within the operational environment. Its purpose is to support the development and/or deployment of real-time analytical applications, such as Web event analysis, traffic flow optimization and risk analysis. Because the analytical models and rules embedded in the real-time analysis platform are likely to be developed in the EDW, the investigative computing component and the real-time platform itself, there must be tight integration and freely flowing data between all these components.
Figure 3 puts it all together in the new XDW architecture. Existing and new data management, BI and analytics technologies can reside side by side, supporting their designated purposes. Each component in the expanded BI architecture is optimized to suit its particular capabilities and functions. Going forward, it's unlikely that this overall architecture will change much. The requirements for production-ready, investigative and real-time analytics capabilities should remain relatively constant. However, the technologies used to provide those capabilities likely will change as new functionality becomes available and improvements are made in the technological underpinnings.
About the authors
A thought leader, visionary and practitioner, Claudia Imhoff, Ph.D., is an internationally recognized expert on analytics, business intelligence and the architectures that support those initiatives. She is also the founder of the Boulder BI Brain Trust, a consortium of independent analysts and consultants.
Colin White is founder of BI Research and president of DataBase Associates Inc. He is well known for his knowledge of data management, information integration and business intelligence technologies and how they can be used to build smart and agile businesses.
See how collaborative BI can help improve decision making
Get tips on bridging the divide between IT and the business
Learn what data scientists can do for your company
Find out why simplicity is key to effective data visualization