Data preparation is the process of gathering, combining, structuring and organizing data so it can be analyzed as part of business intelligence (BI) and business analytics (BA) programs. The components of data preparation include data discovery, profiling, cleansing, validation and transformation; it often also involves pulling together data from different internal systems and external sources.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
Data preparation work is done by information technology (IT) and BI teams as they integrate data sets for loading into a data warehouse, NoSQL database or Hadoop data lake repository. In addition, data analysts can use self-service data preparation tools to collect and prepare data for analysis themselves.
One of the primary purposes of data preparation is ensuring that information being readied for analysis is accurate and consistent, so the results of BI and analytics applications will be valid. Data is often created with missing values, inaccuracies or other errors. Additionally, data sets stored in separate files or databases often have different formats that need to be reconciled. The process of correcting inaccuracies and joining data sets constitutes a big part of the data preparation process.
In big data applications, data preparation is largely an automated task, since it could take years of work by IT staffers or data analysts to manually correct every field in every file that's due to be used in an analysis. Algorithms can speed things up by examining data fields and automatically filling in blank values or renaming certain fields to ensure consistency when data files are being joined.
After data has been validated and reconciled, data preparation software runs files through a workflow, during which specific operations are applied to files. For example, this step may involve creating a new field in the data file that aggregates counts from preexisting fields, or applying a statistical formula -- such as a linear or logistic regression model -- to the data. After going through the workflow, data is output into a finalized file that can be loaded into a database or other data store, where it is available to be analyzed.
Even though data preparation has become highly automated, it can still take up significant amounts of time -- especially as the volume of data used in analyses continues to grow. Data scientists often complain that they spend a majority of their time locating and cleansing data rather than actually analyzing it. Partly for that reason, there has been an increase in the number of software vendors attempting to tackle the data preparation problem, and many organizations are putting more resources toward automating the process of preparing data.