News

Clustering, SMP and MPP in Very Large Data Warehouses (VLDWs)

Dan E. Linstedt

This article originally appeared on the BeyeNETWORK.

Oracle touts RAC as a solution for multiterabyte data warehouses or DSS (decision support systems) solutions. It’s not the Oracle technology that is in question; it’s the notion of clustering machines to achieve a very high scale of data warehousing support. The acronym “RAC” stands for real application clusters, and clustering is a form of linking hardware together such that all the “nodes” (machines) in the same cluster look and act like one big machine. Unfortunately, this is where organizations are mislead, thinking they can scale clusters by buying many smaller machines, rather than investing up-front in the proper big-iron boxes needed to handle the horsepower of very large data warehouses.

    Requires Free Membership to View

Anything at or above 45 terabytes in a single data warehouse is deemed to be a very large data warehouse (VLDW), as long as the loads are large and the data is accessed. Forty-five terabytes of storage with data that is not accessed is not considered a VLDW. To their credit, Oracle has established VLDW systems in place, but most of these are run on single large hardware instances (even in RAC installations), SMPs (symmetrical multiprocessors) or LPAR (logical partitioning) / VPAR (virtual partitioning) on a mainframe.

A Basic Look at Competing Architectures
Before diving into specifics, let’s define some basic terms and basic architectural differences:
 

  • SMP (symmetric multiprocessing): single node machines, scale up CPU, high-speed bus, fast bandwidth, not necessary to “share disk.”

 

Figure 1: SMP Processing

 

  • Clustered: Requires shared disk across all nodes, synchronization of memory and processing must take place across all nodes, high-speed interconnect is required.

 

      Figure 2: Clustered Processing

  • MPP (massively parallel processing): Usually independent SMP components, capable of scaling out and up (within a single SMP node), doesn’t share disk, splits processing into parallel components across the architecture.

Figure 3: Massively Parallel Processing

The Pros and Cons
Based on this author’s experience, these are bound to appear. All of these pros and cons pertain to solutions of 15TB to 45TB or more. Usually, below 45TB not many of these issues become deal breakers.

 

SMP Processing Pros and Cons

Pros Cons
Large scalability Upper size (processing and data) limit due to hardware bus size.
Extremely fast access All CPUs and all RAM must be of the same make/speed and size.
Logical Partitioning

Number of I/O channels must be kept in concert with CPUs in order to avoid bottlenecking.

Single unit Amount of RAM should be at least 1.3x the number of CPUs once 32 CPUs are reached (costly).
Single point of management

Expensive solution to purchase outright.

Doesn't seem to be possible to scale to MPP today (without special hardware).

 

 

Clustering Pros and Cons

Pros

Cons

Mid-tier scalability

 

Hot disk spots with large batch loads.

 

LARGE clustered SMP machines work similar to large MPP nodes

 

Cannot effectively control “slices” of data via partitioning, making it a challenge to “balance” data sets across bandwidth.

 

Single point of management

 

Hot disk spots with large data queries (data that tries to aggregate multi-hundreds of thousands of rows has to synchronize all that data in RAM before aggregating it).

 

Cheaper than single SMP or MPP as an entry point

 

Once a mixed workload is put on the single cluster and large data sets are being written and “collected,” the engine spends more time synchronizing across the network than it does answering the needs of either the load or the query.

 

Due to increased volumes, increased network traffic means increased CPU utilization. It also means a diminishing law of returns – the higher the volume (both query and load), the closer to the “top of the curve” of performance and responsiveness will be experienced.

 

 

MPP Pros and Cons

Pros

Cons

Scale-out solution

 

To add another node, it is necessary to repartition and redistribute data sets.

 

Doesn’t require big-iron nodes

 

Costly to engineer, results in high up-front costs, which appear to be inhibitors to entry.

 

Single point of management

 

 

 

 

Everything is run in parallel, supporting a mixed workload is easy

 

Blurry Lines Between the Technologies
If you think about VLDWs, then you must first be thinking “big iron.” It doesn’t matter if you are thinking SMP, cluster or MPP (SMP nodes hooked together). Big iron (lots of CPU, lots of RAM and high-speed I/O disk) is just the ticket to blur the lines between architectures. Different sized machines often cause problems arising from different sized data sets, which are usable/utilized data sets (either loaded and/or queried).

When we look at a VLDW, let’s consider today’s machine size and speed (as of the writing of this article – which of course will be surpassed shortly thereafter). Requirements for MPP nodes are much smaller. SMP node sizes are considerations for “big iron” running single SMP instances, or running as a part of a clustered environment.

 

SMP or Node Machine Size

Machine Configuration

Small

16 or less CPUs, non-dual core

18GB RAM or less

3 dedicated I/O fiber channels or less

8GB or less "RAM cache" on disk units

<300MB per second throughput on each I/O channel

Medium

16 to 24 CPUs, non-dual core

18 to 48GB RAM

5 to 8 dedicated I/O fiber channels

12 to 24GB RAM cache on disk units

400MB to 2GB per second throughput on each I/O channel

Large

24 to 64 CPUs, dual-core or not

48 to 128GB RAM (or more)

12 to 64 dedicated I/O fiber channels

24 to 168GB RAM cache on disk units

2GB per second or better throughput on each I/O channel

 

 

 

 

When we look at Winter Corporation’s reports on VLDW, they all specify large style machines used for 15 to 48TB+ of information. In fact, in Winter Corporation’s “Sun Enterprise Systems: A Scalable Platform for Very Large Data Warehousing with Oracle”(page 9, paragraph 3), the Sun Fire SMP used is 72 dual-core CPUs, 576GB RAM, 72 hot swap I/O channels. This is a very large SMP machine on which to run Oracle, but it’s not just Oracle – it’s any relational database engine that needs power in order to handle big data problems.

 

Keep in mind that a VLDW isn’t measured in “dead data simply sitting in a table somewhere.” A VLDW is measured by several aspects, some of which include: amount of data being loaded over the course of any given hour (or batch cycle), amount of data being queried or utilized from within the database, and a mix of the workload (for example: How fast can queries return while loading large amounts of data simultaneously?).

 

Big machines have lots of horsepower, and because of this, synchronization across two large machines in an Oracle RAC environment is not the same as synchronization across four or six small machines made up to look like two big machines. Oracle RAC can scale, there’s no doubt about it – but scaling it within large scale SMPs is just like putting Oracle on single SMP big-iron boxes. The operator gets the chance to partition, parallelize and synchronize similar to an MPP environment. In other words, the sum of the parts will not necessarily equal the sum of the whole (when comparing small- or medium-sized boxes to singular or dual large environments).

 

What Oracle says:

“Oracle’s key innovation is a technology called cache fusion, originally developed for Oracle9i Real Application Clusters. Cache fusion enables nodes on a cluster to synchronize their memory caches efficiently using a high-speed cluster interconnect so that disk I/O is minimized. The key, though, is that cache fusion enables shared access to all the data on disk by all the nodes on the cluster. Data does not need to be partitioned among the nodes.” (Source: Oracle Real Application Clusters 10g: The Fourth Generation by Angelo Pruscino, Oracle, and Gordon Smith, Oracle)

In other words, cache fusion is a huge success with smaller numbers of large nodes and can be deadly with larger numbers of small nodes – due to the increase in network traffic in order to keep all the caches synchronized. If you’re not lucky enough to afford large nodes, then you must take into account the next bottleneck: I/O across shared disk. Even the experts agree that RAC with traditional disk is not for the fainthearted. The book, Oracle RAC & Grid Tuning with Solid State Disk: Expert Secrets for High Performance Clustered Grid Computing by Mike Ault and Donald K. Burleson, was the subject of a blog review, and both the book and the review provide more information on the use of SSD (solid state disk) to overcome the performance issues for the RAC solution.

 

 

Solid state disk (RAM type disk) is usually too expensive a solution, and frequently doesn’t scale well when compared with investing in a large scale SMP node for RAC. RAC works best when synchronization doesn’t have to happen across “many” nodes, or is taken care of within a large scale single SMP node. Sun has tremendous success (as does the HP Superdome, and the mainframe with LPARS) as noted in the previously mentioned Winter Corporation report:

 

“Key to Sun’s capability … is its large processor count symmetric multiprocessor (SMP) architecture. … Sun’s SMP architecture relies on … high bandwidth connectivity … with reasonable latency in the few hundreds of nanoseconds.”  (Winter Corporation, “Sun Enterprise Systems: A Scalable Platform for Very Large Data Warehousing with Oracle”, Page 9, paragraph 4).

 

 

 

 

 

 

Before we comment any further on SMP clustering, let’s see what Teradata says about MPP versus SMP:

 

 

“Why does MPP scale differently than the alternatives? 
Every other architecture sacrifices scalability for ease of programming. Each additional CPU that has to share memory increases the potential contention for access to a memory location. The hardware has to keep track of which CPU is using which memory locations so that errors and old data do not creep in. The operating system (OS) also has more sharing and resource contention to manage, thus consuming more system resources as it handles more CPUs concurrently.

 

These liabilities cause both SMP and NUMA to each have a diminishing-returns curve: Each CPU added to the configuration delivers less computing power than the previous one until additional CPUs do not provide any more power to the application. This diminishing-returns curve violates both perfect scale-up and speed-up. Neither can ever be delivered by an SMP or NUMA architecture. 

 

The MPP architecture eliminates the diminishing-returns curve (with software that knows how to take advantage of it) and is the only platform physically capable of delivering perfect scale-up and speed-up. Each additional node adds system resources that are not shared with those already in the configuration. None of the new resources are used to manage sharing, so they all are applied directly to the application.”

 

Now that’s not to say that MPP is for everybody. MPP has its issues of balancing data across each of the nodes, and part of the up-front cost includes PhD engineering for parallel algorithms, high-speed interconnect hardware and high speed disk access.

 

My Point
It takes more than just a pretty RDBMS software engine to “finish the job nicely.” Richard Winter drives this point home when discussing Sun’s total package and what it means to Oracle RAC:

 

“The best cluster performance can be obtained by using the scalable coherent interface (SCI) interconnect that provides 200MB per second bandwidth and exploits remote shared memory to insure low-latency (4 milliseconds).”

 

In other words, making RAC truly work for any VLDW environment requires hardware – incredibly fast and incredibly large hardware – in order to reach clustering as a reality. Now, from an MPP standpoint, it’s pay me now or pay me later because of the law of diminishing returns within both clustered and big-iron SMP architectures. Eventually, the volume of data can (and will) overwhelm the size of the machine, and scaling out (adding nodes) in either one of these solutions (except MPP) will cause bottlenecks to appear, which will only get worse as additional nodes are added.

 

Where has Oracle RAC Succeeded?
There is a case study cited in “Emerging Trends in Business Intelligence” (see slide #15) by Robert Stackowiak, Senior Director of Business Intelligence for Oracle. In this case, the facts are plainly visible:

  • 61TB Query DW on Oracle at Amazon.com

 

  • 100,000 queries/week (mostly complex ad hoc)

  • Amazon runs 2 identical 61TB+ query DWs loaded concurrently. Configuration for each is:
    • 16 node RAC/Linux cluster

    • Oracle10gR1 RAC using ASM on Red Hat Enterprise Linux 3

    • 16 HP DL580s, each w/ 4 3-GHz CPUs

    • 71 HP StorageWorks MSA1000

    • 8 32-port Brocade switches

    • 1 Gigabit interconnect

  • DW Metrics

    • Each holds 51TB raw data (growing at 2x per year)

    • Each is 61TB total database size w/ only 2TB indexes

    • 71TB total disk for each

 

Please notice that there are hyper-threaded fast CPUs, extremely large disk array and super fast interconnect utilized (across a large number of Brocade switches). They have mitigated any “network bottleneck” due to synchronization because of the fast network and high numbers of parallel interconnects. They have mitigated any disk bottleneck with very large storage array and large RAM contingency. Of course, they have tuned the Oracle installation extensively. Installing and configuring Oracle RAC to perform takes a lot of knowledge and highly trained professionals. Amazon’s RAC environment is fail-over across an identical network.

 

What About Scalability at a Later Date?
Keep in mind that “adding nodes” to a clustered environment requires: the same exact hardware (usually – in order to avoid “slowest hardware dictates overall performance” problem) and the same exact software release. In order to scale up a clustered environment, all hardware nodes need to be scaled at the same time. Most MPP architectures can be scaled across multiple machines of varying size and shape depending on the vendor’s OS and RDBMS engines.

 

Conclusions
Oracle RAC is not a bad solution; however, it appears that to meet true VLDW needs, it must be housed on an inordinately large SMP set of platforms. The cost of two such “large” platforms along with the per-CPU licensing that Oracle charges may be an inhibitive barrier to entry. Scaling smaller clustered SMP units seems to be doable (as the Amazon case points out); however, upgrading all units at the same time to go beyond the 70TB mark may prove to be another cost nightmare, not to mention that the migration path to get there may become a technical mud-bog.

 

Oracle (non-RAC) continues to make huge strides on big-iron SMP machines, as do Oracle’s competitors on MPP-based hardware (IBM, Teradata, Netezza, DATAllegro, etc.). Your corporation’s choice of “architecture and cost” should be weighed carefully against how “large” your environment will grow in the next 3 to 5 years. Then, the vendor (no matter which one) should sign a performance-based SLA where they agree to match the performance required from the hardware or architecture platform selected.

 

 

In case you’re wondering, my recommendation has been to select an MPP-based platform for limitless scale out because it allows (if necessary, but not desired) multi-versioned, multi-configured SMP hardware underneath.

  • Dan Linstedt 

    Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

 


There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: