This article originally appeared on the BeyeNETWORK.
Richard Hackathorn and I are currently researching the role of data warehouse appliances, and during our discussions with both vendors and
In any given application workload, transactions either query data and/or modify (insert, update, delete) it. Application workloads broadly break into two types – business transaction workloads (order entry, shipping, billing, and so forth) that support the daily running of business operations, and business intelligence workloads that monitor, report on, and analyze those operations.
In general, business transaction workloads consist of simple, short-running transactions that query and modify small amounts of data. From a performance perspective, the focus is on workload management and on processing multiple transactions in parallel, rather than on the quality of the product’s relational optimizer.
Some business transaction workloads, however, are more complex. Manufacturing applications are a good example. Bill of material processing and requirements generation can involve very complex workloads; and in these cases, indexing and the database product’s relational optimizer become more important.
At the other end of the application spectrum from business transaction processing is business intelligence, which may involve highly complex queries and very large amounts of data. Here, performance is focused on parallel query processing and on data partitioning. Also, the relational database optimizer plays a much more important role here.
To achieve good business intelligence performance, database administrators typically create more indexes on the data compared with business transaction workloads. These indexes give the optimizer additional access paths to the data, reducing the need to sequentially scan data. The downside of these indexes is that they dramatically increase the size of the database and degrade the performance of data warehouse loading and updating because the indexes have to be maintained. Indexing also makes the job of the database administrator more complex and time-consuming.
A data warehouse appliance overcomes the performance overheads of handling complex queries by moving processing and record selection as much as possible to the back-end disk subsystem. Appliances are highly tuned for sequential processing and often eliminate the need for indexes and sophisticated relational optimization.
For processing complex queries against many terabytes of data, appliances can outperform more generalized hardware by huge margins. There is no question that data warehouse appliances are well designed for large data marts that involve a small number of users issuing very complex queries that scan and analyze large amounts of data. Call detail record and point-of-sale databases are a good fit here.
So far, I have discussed the two extremes of application processing – high volume business transaction processing and complex query processing against large data warehouses.
Things become more complicated when organizations implement mixed workloads. There are two types of mixed workloads. The first type is a business intelligence query workload involving both complex and simple queries. This is often the workload that occurs when the number of users (and thus query mix) increases and as the processing moves from accessing a data mart to an enterprise data warehouse.
Mixed query workloads usually require more indexes and better relational optimization to handle simple queries efficiently. These workloads also need more intelligent workload management because you don’t want longer running complex queries to swamp machine resources and lock out the handling of simple queries.
When selecting a database system for use in a mixed query environment, it is important to carefully evaluate the workload management capabilities of the product. Three key factors here are:
- A facility to separate queries into different processing streams and to control the resource
priorities of each stream;
- A facility to automatically cancel or lower the priority of queries that are consuming too many
- A capability to predict query performance and control the priority of query processing based on that prediction.
While there is no question that appliance vendors are adding these capabilities, it will be important to evaluate their use in your environment against live data. In many cases, a proof of concept will be required.
A second type of mixed workload involves modifying data warehouse data in parallel with business intelligence query processing. This type of workload may arise when trickle-feeding updates to a data warehouse to overcome reduced batch windows or when supporting low-latency data for operational business intelligence.
Mixed query and data modification workloads are the most difficult to deal with because the database system has to be good at both business intelligence and the equivalent of business transaction processing. Partitioning data by time and moving new data into a separate partition can alleviate some performance problems, but the overall quality of the product’s workload management, locking protocols, index management and relational optimization play a big part in achieving good performance in such an environment.
Even mixed query and data modification workloads can vary. In some cases, data modifications far exceed the number of queries being used, whereas in other situations, the reverse can be true. The balance between data modification and data querying will again affect the choice of product and hardware used.
In summary, you can see that if your workload is at the extremes of the application spectrum, choosing hardware and software for handling the workload is simpler. Things become more difficult when deploying a mixed query workload and especially when using workloads involving mixed query and data modification operations. Unfortunately, there is rarely one system that fits all cases, and compromises will have to be made based on business priorities. In some situations, multiple systems may have to be used.