A buyer's guide to selecting the right big data analytics software
A collection of articles that takes you from defining technology needs to purchasing options
The analytics process, including the deployment and use of big data analytics tools, can help companies improve operational efficiency, drive new revenue and gain competitive advantages over business rivals. But there are different types of analytics applications to consider. For example, descriptive analytics focuses on describing something that has already happened, as well as suggesting its root causes. Descriptive analytics, which remains the lion's share of the analysis performed, typically hinges on basic querying, reporting and visualization of historical data.
Alternatively, more complex predictive and prescriptive modeling can help companies anticipate business opportunities and make decisions that affect profits in areas such as targeting marketing campaigns, reducing customer churn and avoiding equipment failures. With predictive analytics, historical data sets are mined for patterns indicative of future situations and behaviors, while prescriptive analytics subsumes the results of predictive analytics to suggest actions that will best take advantage of the predicted scenarios.
In many environments, the processing and data storage demands of advanced analytics applications have limited their adoption -- but those barriers are beginning to fall. The growing availability of big data platforms and big data analytics tools has enabled environments in which predictive and prescriptive analytics applications can scale to handle massive data volumes originating from a wide variety of sources.
What does big data analytics mean?
In essence, big data analytics tools are software products that support predictive and prescriptive analytics applications running on big data computing platforms -- typically, parallel processing systems based on clusters of commodity servers, scalable distributed storage and technologies such as Hadoop and NoSQL databases. The tools are designed to enable users to rapidly analyze large amounts of data, often within a real-time window.
In addition, big data analytics tools provide the framework for using data mining techniques to analyze data, discover patterns, propose analytical models to recognize and react to identified patterns, and then enhance the performance of business processes by embedding the analytical models within the corresponding operational applications. For example, massive amounts of shipping delivery data, streaming traffic data, streaming weather data and historical vendor performance data can be analyzed to devise a model for optimal selection of shipping subcontractors within geographic regions to limit the risks of late delivery or damaged goods.
Big data analytics tools can ingest a wide variety of data types: structured data with defined and consistent fields, such as transaction data stored in relational databases; semi-structured data, such as Web server or mobile application log files; and unstructured data, encompassing things like text files, documents, emails, text messages and social media posts.
Powering analytics: Inside big data and advanced analytics tools
A Google search for big data analytics yields a long list of vendors. However, many of these vendors provide big data platforms and tools that support the analytics process -- for example, data integration, data preparation and other types of data management software. We focus on tools that meet the following criteria:
- They provide the analyst with advanced analytics algorithms and models.
- They're engineered to run on big data platforms such as Hadoop or specialty high-performance analytics systems.
- They're easily adaptable to use structured and unstructured data from multiple sources.
- Their performance is capable of scaling as more data is incorporated into analytical models.
- Their analytical models can be or already are integrated with data visualization and presentation tools.
- They can easily be integrated with other technologies.
In addition, the tools must incorporate essential characteristics and include integrated algorithms and methods supporting the typical suite of data mining techniques, including (but not limited to):
- Clustering and segmentation, which divides a large collection of entities into smaller groups that exhibit some (potentially unanticipated) similarities. An example is analyzing a collection of customers to differentiate smaller segments for targeted marketing.
- Classification, which is a process of organizing data into predefined classes based on attributes that are either pre-selected by an analyst or identified as a result of a clustering model. An example is using the segmentation model to determine into which segment a new customer would be categorized.
- Regression, which is used to discover relationships among a dependent variable and one or more independent variables, and helps determine how the dependent variable's values change in relation to the independent variable values. An example is using geographic location, mean income, average summer temperature and square footage to predict the future value of a property.
- Association and item set mining, which looks for statistically relevant relationships among variables in a large data set. For example, this could help direct call-center representatives to offer specific incentives based on the caller's customer segment, duration of relationship and type of complaint.
- Similarity and correlation, which is used to inform undirected clustering algorithms. Similarity-scoring algorithms can be used to determine the similarity of entities placed in a candidate cluster.
- Neural networks, which are used in undirected analysis for machine learning based on adaptive weighting and approximation.
This is just a subset of the types of analyses used for predictive and prescriptive analytics. In addition, different vendors are likely to provide a variety of algorithms supporting each of the different methods.
The advanced analytics market
The market for advanced analytics tools has evolved over time, and the types of tools that are available vary in degree of maturity and, consequently, in capability and ease of use. For example, there are tools with relatively long histories from some mega-vendors like IBM, Oracle and SAS. Other large vendors have acquired companies whose tools have a more recent history, such as those provided by Microsoft, Dell, Teradata and SAP.
A number of smaller companies provide big data analytics products, including Angoss, Predixion, Alteryx, Alpine Data Labs, Pentaho, KNIME and RapidMiner. In some cases, companies have developed their own suite of algorithms. Others have adapted the open source statistical R language and provide predictive and prescriptive modeling capabilities using R's features, or use the software from the open source Weka project.
A third category of products are those available as open source technologies. Examples include the previously mentioned R language, the Mahout software distribution that's part of the Hadoop stack, and Weka.
In some of these cases (particularly with the mega-vendors), the big data analytics tools are incorporated into larger big data enterprise suites. In others, the tools are sold as standalone products. In the latter case, it's the customer's job to integrate with the big data platform being deployed. Most of the tools provide a visual interface to guide the analytics processes (data mining/discovery analysis, evaluation and scoring of models, integration with operational environments), and in most cases, the vendors provide guidance and services to get the customer up and running.
Who uses big data and advanced analytics tools?
While some individuals in the organization are looking to explore and devise new predictive models, others look to embed these models within their business processes, and still others will want to understand the overall impact that these tools will have on the business. In other words, organizations that are adopting big data analytics need to accommodate a variety of user types, such as:
- The data scientist, who likely performs more complex analyses involving more complex data types and is familiar with how underlying models are designed and implemented to assess inherent dependencies or biases.
- The business analyst, who is likely a more casual user looking to use the tools for proactive data discovery or visualization of existing information, as well as some predictive analytics.
- The business manager, who is looking to understand the models and conclusions.
- IT developers, who support all the prior categories of users.
All of these roles would typically work together in the model development lifecycle. The data scientist subjects a swath of big data sets to the undirected analyses provided, and looks for any patterns that would be of business interest. After engaging the business analyst to review how the models work and evaluate how each of those discovered models or patterns could potentially positively affect the business, the business manager and IT teams are brought in to embed or integrate the models into business processes or devise new processes around the models.
From a market perspective, though, it's interesting to consider the types of businesses that are embracing big data analytics. Many of the early users of big data technologies were Internet companies (e.g., Google, Yahoo, Facebook, LinkedIn and Netflix) or analytics services providers. Each of these companies relied on operational and analytical applications requiring fast-flowing streams of data to ingest, process, analyze, and then feed the results back to continuously improve performance.
As appetites for data expand among companies in more mainstream industries, big data analytics has found a place in a more general corporate population. In the past, the cost factors for a large-scale analytics platform would have limited the adoption to only the very largest businesses. However, the availability of utility-style hosted big data platforms (such as those available via Amazon Web Services) and the ability to instantiate big data platforms such as Hadoop on-premises without a large investment have reduced the barrier to entry. In addition, open data sets and accessibility to fire hose data feeds from social media channels provide the raw material for larger-scale data analyses when blended with internal data sets.
Larger businesses may still opt for high-end big data analytics tools, but lower-cost alternatives deployed on cost-effective platforms enable small and medium-size businesses to evaluate and launch big data analytics programs and achieve the desired business improvement results.
Now that we've examined the different types of tools and their uses, the next step is to determine how these tools could benefit your company. By taking a look at the various use cases for big data analytics, you will begin to see where a general big data analytics capability can be leveraged for creating and enhancing value.