This article originally appeared on the BeyeNETWORK.
When building a new data warehouse, most companies concentrate on building a star schema as perfectly as possible. Once the data warehouse has been designed and built, you must usually build indexes to improve those queries that didn’t appreciate the beauty of the star schema. The Extract, Transform & Load (ETL) team focuses on getting the data out of the source systems and into the correct format for the data warehouse. It takes a great deal of trial and error to get the transforms right and validate all of the data.
No one can realize exactly what this new solution costs in load time until the database has been designed, the application has been built and the data has been loaded. Even assuming the data loads within the established load window, what happens as the amount of data being received increases? Another possible scenario is that the loads are fine until management decides they want near-real-time reporting. Now what do you do?
Here are some tips and tricks to speed up loading after the database has already been designed and put into production:
Change the Code
When going back over the ETL code, you will often discover it could be modified to run more effectively. This is the first step to take, before trying any of the more esoteric solutions. You should determine which pieces of the code are running slowly, and if these procedures could be improved by simple code modification. Do not just look at the SQL. Consider temporary tables, staging tables, external sorts and even different selection criteria when pulling the data from the source tables.
Drop and Rebuild Indexes
At times, it is actually faster to drop and rebuild indexes than it is to try to load a table with many indexes. You may also be able to rebuild the indexes at a more opportune time in the load process.
Analyze which load procedures are dependent on others, then create a spreadsheet with the load times for each procedure and their dependencies. The idea here is to balance the load procedures to run those that can be in parallel. Be careful not to compromise the speed of the load for each process. For instance, one particular procedure may take one hour to run but uses five processes. In contrast, another procedure may take one hour to run and only use one process. This second example is more likely to run efficiently in parallel with another procedure. Again, if a procedure is only using one process, closely examine it to see if the procedure itself can be parallelized.
Break Up the Loads
If the load extends past the load window, you might be able to load some of the data earlier in the day, perhaps to some of the dimension tables. If the loads to these tables are quick, it may not affect querying enough for the user community to notice. This will free up the load window for the CPU intensive procedures. If any tables are only used infrequently, they may also be candidates for early or late loading.
If you are loading directly into your production tables, you may want to consider loading into staging tables instead. Loading into the staging tables can be done while queries are being performed against the production tables. By performing the extraction and transformation processing between the operational systems and the staging tables, the loads to the actual production tables should be very quick, as they will be simple inserts, updates, and deletes.
Changed Data Capture
While all of these suggestions can be used for batch processing, they have limited value for near real-time reporting needs. If this is required, there are a few ways to fit your loads in between the queries. You may be able to add columns to the source tables that signify when a record has been modified. You may also be able to create triggers that send the data to staging tables on the source system which are then queried by a job on the target system.
If you are unable to modify the source systems, however, you may need to consider one of the tools that will perform changed data capture. These tools actually read the log files or change tables that the databases use to process changes to tables. By accessing these files, the tool doesn’t significantly affect the source system, none of the tables need to be modified and no triggers need to be built. The entire process is non-intrusive. You must create tables on the target system to handle the captured data, but you will be getting the data as it is created on the source system. This is as close to real-time as you can get without querying the source databases.
Partition Fact Tables
If your fact tables are not already partitioned, you may want to consider doing this. Loading into a partition is faster than loading into the entire table. If your indexes are also partitioned, you will only be loading to one partition in an index too. Laying out partitions across disks is an art that can devour many resource hours. I recommend building a script that automatically creates partitions, and assigns them based on a particular algorithm.
Move the Tables
Poor database layout is something that can adversely affect loads. If you are trying to load to several tables all on the same disk, you will have I/O contention. If your indexes are on the same disk as your tables, you will encounter I/O contention again. Make sure your database is laid out correctly, with the tables that will be loaded in parallel located on different disks, and the indexes on other disks.
You may need to add processors, memory and disks if you have tuned your load process as far as possible and it still overwhelms the system. System diagnostic tools, such as iostat or third party software, can help you determine the level of I/O and memory contention. Some of these tools include graphics that greatly help in deciphering the data. If you add hardware, you may need to balance the load again. If you’ve added disks, you may need to restructure your database layout to take full advantage of the additional space. When it is used properly, the additional space should reduce I/O contention. Added processors sometimes help with I/O contention, and added memory may reduce memory contention. It may allow you to do more processing in memory, and perhaps increase the size of your cache.
While you can tweak your load process with the various database tuning parameters, this should also be a last resort. Remember when tuning the database, you are tuning it for the queries, not just the load process. What might speed up the load may slow down the queries. In any case, adding processors, memory or disks, as well as changing the database tuning parameters or modifying the init file, should only be done as a last resort. In most cases, the previous steps will be enough to significantly improve the speed of your loads.
When faced with a load process overflowing the load window, it is tempting to simply throw more memory or disks at it. But step back and analyze the process first. It is oftentimes helpful to go through these outlined steps and see how far you can tune your load process. You might be surprised. Of course, if you suddenly find you have to switch to near real-time loads, it is already obvious that your current load process won’t work! However, a step-by-step analysis will still help determine which direction you should take. Although the load process is usually the last thing people consider in software development projects, it is nonetheless one of the most important. Take the time to analyze it.
The load balancing process is an iterative process, which means you must monitor it and return through the steps above. It may only be necessary every few months, or you may need to check it monthly, depending on your data load. The process will most likely degrade over time, as more data gets loaded. You may eventually find that you’ve reached the “Last Resort” stage and need to buy new hardware. By monitoring the load process, however, you will know long before it becomes an emergency. Then you can include it in your plan for the year.