This article originally appeared on the BeyeNETWORK.
Happy New Year! During January, many people advocate making resolutions for improvement: Feng
Instead, I will advocate doing less, storing more and essentially adding "pounds" to the telecom business intelligence and data warehousing data store, if not for the entire organization. I want to advocate the selected admission of "uncleansed" and "untranslated" data into the data warehouse. In essence, this will remove part of the "T" from ETL.
Over the past few months, I have found some great examples where too much of a good thing, such as data cleansing or data transformation, can actually decrease the value of data in a business intelligence organization. In these situations the raw data was actually better than the translated data. This was because the people looking at it were actually looking for bad data, not "good" data.
For example, telecom detail records (CDRs, IPDRs, etc) are often cleansed and translated to correct data errors during the ETL process into a data warehouse. This process will correct errors in data, and remove calls that have no accounting value. Two good examples of removed calls are duplicate detail records and zero length detail records. The resulting cleansed detail records theoretically reflect what a telecom service provider's customers meant to do with their services based on a certain set of criteria.
This is great if you want to minimize the number of detail records in your business intelligence data store, or if you want to provide information about the intended use of services. But if you are looking for information outside of those parameters, you do not have the necessary information due to the translation process.
Quick--How do astronomers see black holes with a telescope? Ahhh, it is a trick question. They do not actually "see" black holes. Astronomers see an absence of what they expect to see when a black hole is between them and the expected object. If you were to put an ETL process around that type of data, one might fill in the missing data with the expected data. In this case, you would miss the existence of the black hole by translating the data.
For telecom service providers, the ETL process on detail records can mask the existence of similar events. There can be value in duplicate records, zero length records and other types of un-intended usage events. However, it is prohibitive to go back and reload/reprocess billions of records just to find those "black holes."
What is my solution to this problem?
One solution would be to load all of the raw detail, including duplicated, corrupted and "valueless" records, as well as the translated detail records. However, this more than doubles the amount of detail record information in a business intelligence environment. When you are talking about billions of daily detail records, they can add up very quickly. This is great news for data storage vendors, but not if you are responsible for maintaining that much data.
Another solution would be to continually modify the ETL process to ensure that--the proper data is present in the data warehouse. However, this reminds me of one of my favorite Curly Haired Boss quotes from Dilbert:
I want to be notified of all unplanned outages in advance
It is hard to determine where the next black hole or interesting observation regarding the detail record data will come from, and then the data in the data warehouse is not as valuable as it could be. Also, your ETL development team will probably stop taking your phone calls and emails.
What is my solution to this problem?
I would advocate loading all the detail record information unless it is totally useless. An arbitrary number would be any detail record with over 25% valid, non-corrupted data. This would be loaded into the business intelligence environment with minimal adjustment or translation to the data. This provides a middle ground so you aren't more than doubling detail record retention, and you would still have most of the raw records for the detection of both current and future black holes.
Incidentally, the storage vendors are still getting the better end of this deal. Similarly, the ETL team will probably still stop taking your phone calls/
So for 2006, let's remove some of the "T" from our ETL; and let's add some unwanted pounds to our business intelligence and data warehousing environment.
Quick Notes and Observations
2006 is already becoming an interesting year for telecom service providers and their business intelligence organizations. To all who say there is "gold in them dar hills" when it comes to data mining the usage records of customers, briefly look at the current situation with the National Security Agency and their call monitoring of and to US citizens.
While I am not taking a position on the NSA's activities, the amount of media coverage and strong public opinion in both directions show the potentially slippery slope when mining data from usage records. This is particularly true when you are a publicly traded company with millions of privacy sensitive customers.
Telecom service provider business intelligence organizations should take a step back and question the risk/reward proposition on data mining activities. But don't give away the keys of the kingdom to the lawyers in the risk management group. Be mindful of the risks, but also be bold to find the business intelligence gold in those hills of data.
John has more than 10 years of information technology and consulting experience in positions including business intelligence subject-matter expert, technical architect and systems integrator. Over the past eight years, he has gained a wealth of business and information technology consulting experience in the telecommunications industry. John specializes in business intelligence/data warehousing and systems integration solutions. John may be contacted by email at John.Myers@BlueBuffaloGroup.com.