Data exploration is the first step in data analysis and typically involves summarizing the main characteristics of a data set, including its size, accuracy, initial patterns in the data and other attributes. It is commonly conducted by data analysts using visual analytics tools, but it can also be done in more advanced statistical software, such as R.
The role of data exploration
Before it can conduct analysis on data collected by multiple data sources and stored in data warehouses, an organization must know how many cases are in a data set, what variables are included, how many missing values there are and what general hypotheses the data is likely to support. An initial exploration of the data set can help answer these questions by familiarizing analysts with the data with which they are working.
Once data exploration has uncovered the relationships between the different variables, organizations can continue the data mining process by creating and deploying data models to take action on the insights gained.
Data exploration methods
Companies can conduct data exploration via a combination of automated and manual methods.
Analysts commonly use automated tools such as data visualization software for data exploration because these tools allow users to quickly and simply view most of the relevant features of a data set. From this step, users can identify variables that are likely to have interesting observations.
By displaying data graphically -- for example, through scatter plots, density plots or bar charts -- users can see if two or more variables correlate and determine if they are good candidates for further analysis, which may include:
- Univariate analysis: The analysis of one variable.
- Bivariate analysis: The analysis of two variables to determine their relationship.
- Multivariate analysis: The analysis of multiple outcome variables.
- Principal components analysis: The analysis and conversion of possibly correlated variables into a smaller number of uncorrelated variables.
Manual data exploration methods may include filtering and drilling down into data in Excel spreadsheets or writing scripts to analyze raw data sets.
After the data exploration is complete, analysts can move on to the data discovery phase to answer specific questions about a business issue. The data discovery process involves using business intelligence tools to examine trends, sequences and events and creating visualizations to present to business leaders.
Data exploration tools and vendors
Analysts can explore data using features in business intelligence tools and data visualization software, such as MapR, Microsoft Power BI, Qlik and Tableau. Data profiling and preparation software from vendors including Trifacta and Paxata can help organizations blend disparate data sources to enable faster data exploration by analysts. There are also free, open source data exploration tools, such as MIT's DIVE, which include visualization features and regression capabilities.