A buyer's guide to selecting the right big data analytics software
A collection of articles that takes you from defining technology needs to purchasing options
There are many vendors selling products classified as big data analytics software. However, it's challenging to differentiate these products based on functionality alone, as many of the tools share similar features and capabilities. Additionally, some of the tools exhibit extremely subtle differences. That being said, your key differentiating factors will likely focus on balancing ease of use, algorithmic sophistication and price in relation to your organization's capability and level of maturity in analytics.
In this article, we examine products from nine big data analytics software vendors: Alteryx, IBM, KNIME.com, Microsoft, Oracle, RapidMiner, SAP, SAS and Teradata. Some of them provide more than one tool (see the "Leading vendors of big data analytics software" sidebar below for more details about the specific product offerings). These vendors represent different facets of the big data analytics market. Based on the characteristics described in our earlier articles, let's compare and contrast the ways that the products are targeted to meet the business needs of user organizations.
Considerations for choosing the right big data analytics software: Analyst expertise and skills
Some of the tools are targeted to novice users, some are targeted to expert data analysts, and some are engineered to appeal to both types of users.
Products such as IBM SPSS Modeler, RapidMiner's tools, Oracle Advanced Analytics and the Automated Analytics version of SAP Predictive Analytics are generally designed to enable users who have little or no background in statistics or data analysis to analyze data, develop analytical models and design analytics workflows with little or no coding. While each vendor wraps the core analytics components with an intuitive user interface to guide the analyst's progress in data preparation, analysis and then model design and validation, the approach taken may differ, especially when comparing a standalone product (such as RapidMiner) with one that's a component of a larger suite (such as the Oracle product).
Tools such as IBM SPSS Statistics, KNIME Analytics Platform, the Expert Analytics module of SAP Predictive Analytics, Microsoft Revolution Analytics and the Teradata Aster Discovery Platform provide the more sophisticated functionality that expert users expect. Oracle R Advanced Analytics for Hadoop (ORAAH), which is one of the components in the Oracle Big Data Software Connectors Suite, provides an R interface for manipulating Hadoop Distributed Files System data and writing mapper and reducer functions in R. This flexibility may be appealing to more advanced data scientists.
Alteryx and SAS Enterprise Miner offer functionality adapted to the user's level of expertise, and essentially fall into both categories. Overall, SAS Enterprise Miner and IBM's SPSS tools stand out when it comes to supporting more advanced analytical techniques and model scoring, as well as a broader array of analysis functions including neural networks, association analysis and visualization capabilities.
Depending on the use case and application, your organization's users will be required to support different types of analytics capabilities that will use specific types of modeling (e.g., regression, clustering, segmentation, behavior modeling and decision trees). While that has resulted in broad support for the various forms of analytical modeling at a high level, some vendors have invested decades of work into tweaking different versions of their algorithms and adding more sophisticated functionality. It's important to understand which models are most relevant to your business problem and evaluate the products in terms of how they best serve your users' business needs.
The more mature and higher-end (and, accordingly, higher-priced) tools will exhibit the greatest analytical breadth. Oracle Data Miner includes an array of well-known machine learning approaches to support clustering, predictive mining and text mining. Both editions of IBM's SPSS products provide a diverse set of analytical techniques and models. And SAS Enterprise Miner supports many algorithms and techniques, including decision trees, time series, neural networks, linear and logistic regression, sequence and Web path analysis, market basket analysis and link analysis.
Leading vendors of big data analytics software
- Alteryx, which consists of a Designer module for designing analytics applications, a Server component for scaling across the organization and an Analytics Gallery for sharing applications with external partners.
- IBM, which provides SPSS Modeler, a tool targeted to users with little or no analytical background. IBM also has SPSS Statistics, which is geared toward more sophisticated analysts.
- KNIME, an open source product commercialized by software vendor KNIME.com that includes an analytics platform and a number of commercial extensions for big data, cluster operations and collaboration.
- Microsoft Revolution Analytics, which spans two products -- Revolution R Open, a free download that's an enhanced version of the R programming language, and Revolution R Enterprise, which supports the use of R in clustered environments (like Hadoop).
- Oracle Advanced Analytics, which includes Oracle Data Miner, Oracle R Advanced Analytics for Hadoop and Oracle Big Data Discovery, as well as connectors and interfaces for SQL and R.
- RapidMiner, which provides a Studio component for design, a Server component, a Hadoop connector called Radoop and a component for stream processing.
- SAP Predictive Analytics, which comprises two versions, Automated Analytics (for business users without a formal background) and Expert Analytics (targeted to professional data analysts and data scientists).
- SAS Enterprise Miner, which is intended to help users quickly develop descriptive and predictive models, including components for predictive modeling and in-database scoring.
- The Teradata Aster Discovery Platform, which is a framework offered by Teradata with its Aster database, Discovery Portfolio with built-in analytics functions, a graph processing engine, MapReduce and a version of R.
The newer generation (and, in some cases, lower-priced) products support different models, but perhaps with a narrower range of algorithmic sophistication. The model inventory in Alteryx Analytics Gallery includes such capabilities as regression analysis, decision trees, association rule analysis, classification and time series analysis. KNIME includes methods for text mining, image mining and time series analysis, and also integrates machine learning algorithms from other open source projects, Weka R and JFreeChart.
Another aspect of analytical diversity is integration with programming languages and statistical tools, such as R, for integrating existing libraries as well as user-defined functionality. In fact, integration with R could be considered an increasingly critical differentiator. Alteryx Designer, Microsoft's Revolution Analytics, SAS Enterprise Miner, the Teradata Aster Discovery Platform, ORAAH from Oracle and KNIME's Analytics Platform all interface and support integration with R.
Scope of the data to be analyzed
There are multiple facets of the scope of the data to be analyzed, including the issue of structured vs. unstructured information as well as access to conventional on-premises databases and data warehouses, cloud-based data sources, and data managed in big data platforms such as Hadoop. However, there are varying degrees of support for data managed within less-conventional data lakes (either managed within Hadoop, or in other NoSQL data management systems intended to provide horizontal scaling). The factors for distinguishing among the products must be based on your organization's specific requirements for accessing and processing data volumes and data variety.
Support for scalability and high performance
The need for scalable performance is driven by your organization's data volumes and appetite for analysis. Smaller organizations with less data may be able to tolerate products that don't have performance characteristics that scale with the available resources, such as the entry-level versions of the lower-end tools (RapidMiner, KNIME, Microsoft Revolution R Open, Alteryx Designer), which can run on desktop systems and don't require additional server components.
Larger organizations are more likely to have a greater inventory of data sets to analyze, as well as broader communities of users. This introduces two additional requirements -- high performance and facilitation of collaboration. Adaptability of the product to high-performance architectures is a good indication of scalability, and most of the products can be adapted to the parallelism of Hadoop or employ some other means of achieving faster computation.
All of the products do have some support for Hadoop, including IBM SPSS Modeler and SPSS Statistics, RapidMiner's commercial component Radoop (which connects the Studio front end and Server analysis engine to data stored on Hadoop), Oracle's Big Data Discovery and ORAAH tools, and KNIME's Big Data Extension and Cluster Execution add-ins. The Teradata Aster Discovery Platform addresses high-performance requirements through Teradata's MPP architecture. SAP's Expert Analytics version of SAP Predictive Analytics can execute in-memory data mining for handling large-volume data analysis efficiently. Microsoft R Enterprise leverages Revolution Analytics' ScaleR module, a comprehensive library of big data analytics algorithms that support parallelization. Scoring algorithms implemented using SAS Enterprise Miner can be deployed and executed within a Hadoop environment.
As noted, the larger the organization, the more likely there will be a need to share analyses, models and applications across different groups and among many analysts.Organizations that have many analysts distributed across the enterprise may look for increased means of sharing models and collaborating regarding interpretation of the results. IBM's SPSS Modeler Gold edition provides collaboration capabilities, and RapidMiner's Server product provides support for sharing and collaboration. Alteryx Analytics Gallery provides a mechanism for sharing sophisticated analytics applications in the cloud with members of an extended organization. KNIME offers commercial extensions to support team collaboration. SAS Enterprise Miner's client-server architecture enables business users and data analysts to work collaboratively by sharing models and other work products.
Vendor size and product integration
Vendors can be compared in terms of their size. One might contrast what could be referred to as the mega-vendors whose big data analytics tools are just one set of products among a massive portfolio of tools. If you work for a larger organization that typically negotiates site-wide, enterprise licenses for the full suite of a vendor's tools, a mega-vendor such as such as IBM, SAS, SAP or Oracle may be a reasonable choice.
The large vendors sell big data analytics tools that are a part of a much larger tool ecosystem. Presumably, the products from a single mega-vendor have all been at least somewhat integrated and are intended to work together. In addition, some people feel more comfortable with bigger vendors, with an expectation of stability and consistent customer service. On the other hand, you may be only able to acquire these big data analytics tools as part of a much larger software licensing arrangement.
Smaller vendors, such as KNIME, Alteryx and RapidMiner, have revenues that are generally based on licensing and support for a small number of big data analytics products. A smaller vendor may provide closer contact with their product management and innovation teams, and you may be able to influence the direction of the product roadmap or enhanced functionality. A smaller vendor might also be more flexible in terms of price and features included in the licensing arrangement. Realize, however, that working with a smaller vendor does present some risk in terms of stability, the resources available for support and the possibility that the company may be acquired, impacting the customer relationship.
Budget for licensing and maintenance
Almost all of the vendors sell different versions or editions, with a range of costs for acquisition and total cost of operation. IBM, Oracle, RapidMiner, Teradata and Microsoft sell editions at different tiers, with the license cost proportional to the features, capabilities and freedom from limitations in terms of the volumes of data to be analyzed or the number of processing nodes the product can use. KNIME and RapidMiner provide free and/or open source versions of their products, either charging for support services or for editions supporting enterprise-class applications. KNIME, RapidMiner and Alteryx have relatively low licensing costs for a smaller number of users. If you're considering SAS or and SAP, you must contact them for pricing alternatives.
Test your knowledge of big data analytics terminology
Big data analytics, big challenges?
Biggest mistakes to avoid when deploying big data analytics tools