LAS VEGAS -- The road to better business intelligence (BI) doesn't necessarily have to begin with a major data warehousing initiative, according to attendees at this week's Data Warehousing Institute (TDWI) conference and one industry expert.
At the very least, said Joshua Greenbaum, principal analyst with Enterprise Applications Consulting in Daly City, Calif., companies that must build a data warehouse -- a historical data repository for the purpose of analysis and reporting -- don't need to pour as much money into it as they may think.
"One of the fallacies has been that BI has to start with a very expensive and complex data warehouse," Greenbaum said in a phone interview. "There has been a lot of money spent on data warehouses and not necessarily much value gotten out of them."
Before jumping into a data warehousing project, Greenbaum said companies should first see how far they can get with simple analyses of transactional data -- analyses that do not require large quantities of historical data.
"Try as hard as you can not to build a data warehouse and see how far you can get doing really smart analysis without just piling data on top of data," Greenbaum said. "There are data warehouses out there that are multi-terabytes in size, and I assure you they're not necessarily providing good analysis."
"If you're, say, Edge Dynamics, and you want to check order status, you're not going to be building a big data warehouse to do that," he said. "You're going to be building a very small operational data store that is going to let you understand the difference between a good order and a bad order. But that's not a big data warehouse."
The 'logical' way to get started on data warehousing
Companies that do have the need for a full-fledged data warehouse should remember to follow a "natural progression" that begins with logical modeling, said Anthony Politano of Business Edge Solutions, one of the two main keynote speakers at the TDWI conference here.
In many respects, Politano said, building a data warehouse for reporting is just like building, for example, a standard DB2 database for OLTP, which, he added, is something that companies have historically done well.
"You still need to have a logical model and a physical model and it bugs the heck out of me that so many of the projects I walk into are in trouble [because they don't have] a logical data model," Politano said. "They started right away by going to the DBAs and saying, 'build me a star schema.'"
Politano said the key to a successful data warehouse is to think before doing.
"Model the heck out of this stuff using tools like [Entity Relationship for Windows] and go through a natural progression of a three-phase data model," he explained. "Get a relationship data model, convert it to a technically specific data model, and then convert it over to your star schemas and/or snowflakes and/or aggregation tables that are required."
"Don't start with star schemas," Politano added emphatically. "I'm tired of going in and cleaning up the mess for people."
Business and IT alignment essential
Greenbaum added that it's important for companies to figure out precisely what they're trying to analyze before dumping all sorts of unnecessary data into a data warehouse. He said this process involves close interaction between business and IT and will help both sides determine exactly what is needed in a data warehouse and what isn't.
"Put the horse before the cart," Greenbaum said. "Know what you want before you build the super-sanitary landfill or boil-the-ocean data warehouse, because that is going to be a resource-sucking monster."
Raymond Karrenbauer, chief IT architect for ING Worldwide and a featured keynote speaker at the TDWI conference, agreed that business and IT have to decide exactly how to define data to ensure accurate reporting.
Business and IT must decide at which point a sale is categorized as a sale, or exactly when a particular month of marketing initiatives comes to a close. Also, a company that has IBM for a client, for example, will need to decide whether that company will be known in their system as IBM or International Business Machines, because if the name of the company doesn't sync up, the historical data or analyses related to that company are likely to be flawed.
"You've got to get everyone agreeing to the definitions early, because even if you've captured the data right and you've done the right technical quality transformation processes, inevitably the user base that is reading this thing is going to come back and produce [incorrect] reports," Karrenbauer said.
Beware of frequent patching
The rate at which database and other IT vendors release patches for their systems has increased greatly in recent years, and this can potentially lead to data warehousing problems, Karrenbauer warned.
Whereas major patch releases used to come out with new versions of software, today vendors are issuing patches as often as every two weeks, Karrenbauer said. As a result, his company is starting to see mismatches between client software and server software.
"There has got to be a better mechanism or more automated testing brought into the environment so that a lot of time isn't wasted every time there is a new fix, patch or some kind of issue," he said.
Potential operational data store problems
Firms that opt to follow Greenbaum's suggestion and build a small operational data store for simple analyses and reporting can learn a great deal from TDWI conference attendee William Foster, a senior business data strategist with IBM who works on IBM's internal information management projects, specifically those projects related to customer sales and order status.
Foster said he's been working on IBM's operational data store initiative for about five years, and while the system has made great progress, it is still far from perfect, thanks mainly to what he calls the prevalence of "non-normalized" data.
"We're still discovering non-normalized information, meaning that we thought we knew what data was when it was coming in, but it apparently wasn't researched well enough," Foster said. "We didn't understand what this particular value meant or what that particular field was, but rather than leave it out, we just brought it in and called it something and hoped that we would figure it out later."
As a result, IBM is getting lots of nulls in the database because disparate sources of information were not reconciled in all cases.
The lesson learned?
"Those nulls much of the time represented an opportunity where we could have found a home and normalized the field consistently across the whole enterprise where it would have made more sense," Foster said.