This article originally appeared on the BeyeNETWORK.
When building a new data warehouse, most companies concentrate on building a star schema as perfectly as possible. Once the data warehouse has been designed
Requires Free Membership to View
No one can realize exactly what this new solution costs in load time until the database has been designed, the application has been built and the data has been loaded. Even assuming the data loads within the established load window, what happens as the amount of data being received increases? Another possible scenario is that the loads are fine until management decides they want near-real-time reporting. Now what do you do?
Here are some tips and tricks to speed up loading after the database has already been designed and put into production:
Change the Code
When going back over the ETL code, you will often discover it could be modified to run more effectively. This is the first step to take, before trying any of the more esoteric solutions. You should determine which pieces of the code are running slowly, and if these procedures could be improved by simple code modification. Do not just look at the SQL. Consider temporary tables, staging tables, external sorts and even different selection criteria when pulling the data from the source tables.
Drop and Rebuild Indexes
At times, it is actually faster to drop and rebuild indexes than it is to try to load a table with
many indexes. You may also be able to rebuild the indexes at a more opportune time in the load
process.
Parallelization
Analyze which load procedures are dependent on others, then create a spreadsheet with the load
times for each procedure and their dependencies. The idea here is to balance the load procedures to
run those that can be in parallel. Be careful not to compromise the speed of the load for each
process. For instance, one particular procedure may take one hour to run but uses five processes.
In contrast, another procedure may take one hour to run and only use one process. This second
example is more likely to run efficiently in parallel with another procedure. Again, if a procedure
is only using one process, closely examine it to see if the procedure itself can be
parallelized.
Break Up the Loads
If the load extends past the load window, you might be able to load some of the data earlier in the
day, perhaps to some of the dimension tables. If the loads to these tables are quick, it may not
affect querying enough for the user community to notice. This will free up the load window for the
CPU intensive procedures. If any tables are only used infrequently, they may also be candidates for
early or late loading.
Stage Tables
If you are loading directly into your production tables, you may want to consider loading into
staging tables instead. Loading into the staging tables can be done while queries are being
performed against the production tables. By performing the extraction and transformation processing
between the operational systems and the staging tables, the loads to the actual production tables
should be very quick, as they will be simple inserts, updates, and deletes.
Changed Data Capture
While all of these suggestions can be used for batch processing, they have limited value for
near real-time reporting needs. If this is required, there are a few ways to fit your loads in
between the queries. You may be able to add columns to the source tables that signify when a record
has been modified. You may also be able to create triggers that send the data to staging tables on
the source system which are then queried by a job on the target system.
If you are unable to modify the source systems, however, you may need to consider one of the tools that will perform changed data capture. These tools actually read the log files or change tables that the databases use to process changes to tables. By accessing these files, the tool doesn’t significantly affect the source system, none of the tables need to be modified and no triggers need to be built. The entire process is non-intrusive. You must create tables on the target system to handle the captured data, but you will be getting the data as it is created on the source system. This is as close to real-time as you can get without querying the source databases.
Partition Fact Tables
If your fact tables are not already partitioned, you may want to consider doing this. Loading into
a partition is faster than loading into the entire table. If your indexes are also partitioned, you
will only be loading to one partition in an index too. Laying out partitions across disks is an art
that can devour many resource hours. I recommend building a script that automatically creates
partitions, and assigns them based on a particular algorithm.
Move the Tables
Poor database layout is something that can adversely affect loads. If you are trying to load to
several tables all on the same disk, you will have I/O contention. If your indexes are on the same
disk as your tables, you will encounter I/O contention again. Make sure your database is laid out
correctly, with the tables that will be loaded in parallel located on different disks, and the
indexes on other disks.
Last Resort
You may need to add processors, memory and disks if you have tuned your load process as far as
possible and it still overwhelms the system. System diagnostic tools, such as iostat or third party
software, can help you determine the level of I/O and memory contention. Some of these tools
include graphics that greatly help in deciphering the data. If you add hardware, you may need to
balance the load again. If you’ve added disks, you may need to restructure your database layout to
take full advantage of the additional space. When it is used properly, the additional space should
reduce I/O contention. Added processors sometimes help with I/O contention, and added memory may
reduce memory contention. It may allow you to do more processing in memory, and perhaps increase
the size of your cache.
While you can tweak your load process with the various database tuning parameters, this should also
be a last resort. Remember when tuning the database, you are tuning it for the queries, not just
the load process. What might speed up the load may slow down the queries. In any case, adding
processors, memory or disks, as well as changing the database tuning parameters or modifying the
init file, should only be done as a last resort. In most cases, the previous steps will be enough
to significantly improve the speed of your loads.
Conclusion
When faced with a load process overflowing the load window, it is tempting to simply throw more
memory or disks at it. But step back and analyze the process first. It is oftentimes helpful to go
through these outlined steps and see how far you can tune your load process. You might be
surprised. Of course, if you suddenly find you have to switch to near real-time loads, it is
already obvious that your current load process won’t work! However, a step-by-step analysis will
still help determine which direction you should take. Although the load process is usually the last
thing people consider in software development projects, it is nonetheless one of the most
important. Take the time to analyze it.
The load balancing process is an iterative process, which means you must monitor it and return
through the steps above. It may only be necessary every few months, or you may need to check it
monthly, depending on your data load. The process will most likely degrade over time, as more data
gets loaded. You may eventually find that you’ve reached the “Last Resort” stage and need to buy
new hardware. By monitoring the load process, however, you will know long before it becomes an
emergency. Then you can include it in your plan for the year.
Business Intelligence Strategies for the CIO
Join the conversationComment
Share
Comments
Results
Contribute to the conversation