This is the first of a two-part series that will examine trends and market drivers in data warehousing and business...
intelligence for the second half of 2009 and, just as important, what IT directors, managers and executives should do about them.
Column-oriented analytic databases are making waves. In brief, the column-oriented database decomposes rows into individual data elements and stores one occurrence of the database element corresponding to all the rows. Since much commercial business data contains redundancies, this method of decomposition intrinsically shrinks the amount of storage -- even prior to the application of compression algorithms. However, the interface to the data is still standard SQL, and the database model is declarative, not requiring the user to manage pointers or explicit navigation. This has given impetus to products such as Sybase IQ, Sand Nucleus, Vertica and Alterian, as well as add-ons like BMMSoft, which layers metadata and functionality on top of Sybase IQ. Relatively new upstarts such as ParAccel and InfoBright are redefining what is possible with column-oriented parallel processing and data mining of metadata, respectively.
Data warehousing appliances
The data warehousing appliance market is still going strong. Offerings from Dataupia, Greenplum and Netezza continue to disrupt short lists, getting consideration alongside standard relational databases. Keeping up with the new appliance-like products and reference architectures is now a full-time job. At last count, Oracle is participating in two: partnering with HP (and other hardware providers) in its Exadata accelerator (distinct from an "appliance," see recommendations below); and the Oracle Optimized Data Warehouse for Sun, which was in place even before Oracle's acquisition of Sun. This acquisition will give Oracle a complete technology stack onto which to "appliance-ize" its flagship relational database.
Cloud computing and BI
Cloud computing has matured quickly with the new-age players such as Amazon, Google and Salesforce.com marching on the more-established IT vendors. Not to be outdone, Kognitio is promoting data warehousing as a service (DaaS) through U.K.-based service provider 2e2. The battle between terminologies is ongoing. Greenplum is touting its Enterprise Data Cloud (EDC) warehousing platform at Fox Interactive Media (FIM), where a highly distributed data warehouse undergirds click-stream processing and the analytics of social networking (i.e., MySpace). For its part, Aster Data Systems has branded "front-line data warehousing" using Google's MapReduce functionality and got Gartner to discuss it, thereby providing credibility and buzz.
The data cloud
In an entirely separate domain, the data cloud will be significant in healthcare and the public sphere. The National Health Information Network (NHIN) lies on the critical path to health information exchange among payers, providers and administrators. This is an important use case where data is highly distributed, computationally intensive, and on the critical path to enabling an electronic medical record (EMR) and healthcare analytics in support of pay-for-performance. However, the cloud infrastructure will ultimately have a function as important within the multi-divisional enterprise as between them .
Data warehouse tips and recommendations: Do not confuse an appliance with an accelerator. An appliance replaces a data mart or data warehouse; an accelerator is installed in addition to the existing system, which does not go away. The latter are getting traction at the back end of SAP Business Warehouse systems for caching frequently executed resource-intensive queries. Oracle RAC also seems to benefit from the accelerator trend.
Column-oriented databases can nearly pay for themselves. Depending on the scenario, organizations can save significantly on storage technology because the column-oriented database intrinsically shrinks the amount of space required to persist the data. This occurs even prior to the application of formal compression algorithms. Caution is required, however, since your mileage may vary. Also, few if any enterprises have made them the central hub in a hub-and-spoke architecture. Still, the order of magnitude gains in price/performance demonstrated by some vendors (e.g., ParAccel 6/21/2009 benchmark [www.tpc.org]) mean they advance onto the short list along with appliances and standard relational data warehouses.
If you are considering a data warehouse appliance, perform a readiness assessment. If your firm has limited (or no) experience with a given technology, do your homework. Tell the truth about how capable the firm is of bringing new technology on board. New technology often generates new roles and responsibilities. Innovations in performance may enable new possibilities in business processing. The integration of technology, people and processes requires planning, or the risk of acquiring and installing "shelf ware" becomes a reality. This applies to integrating any new technology.
The hub-and-spoke architecture continues to be the most flexible. The general rule of data integration is to minimize costs by minimizing the number of system interfaces to support and maintain. Point-to-point is the most inefficient. In this scenario, column-oriented analytic databases and appliances will be the nodes at the ends of the spokes, not the central hub, which is where the enterprise data warehouse on the standard relational database will continue to be critical.
Manage column-oriented databases and data warehouse appliances by the same process for functional data marts. While analytic databases (such as ParAccel) that are offering high availability functions as analytic applications become ever more mission-critical, the service-level agreement is still less rigorous than run-your-business transactional systems.
Divide and conquer. For handling large data volumes, the proven path to scalability lies through parallel processing – multiple servers each with its own storage connected by a high-performance network that appears to the end user as a single system image. Other innovations are coming around data caching in large in-memory address spaces. They will reduce or eliminate I/O to disk and asynchronous but transactionally rigorous writing operations to the storage area network.
Cloud computing and the associated "data cloud" fit distributed data and distributed enterprises. Startups and those given to prototyping will also benefit from the cloud model. Cloud computing takes Software as a Service (SaaS), grid computing and virtualization to a new level, abstracting the entire data center behind an interface that enables the retail purchase of computing resources. For many scenarios, this is not appropriate. Ultimately, the data cloud will be as significant within the multi-divisional, distributed enterprise as between them.
About the Author
Lou Agosta, Ph.D., is an independent industry analyst specializing in data warehousing, data quality, data mining, and business intelligence. Keyword: data. His book, The Essential Guide to Data Warehousing, is published by Prentice Hall PTR. Lou can be reached at LAgosta@acm.org.