A buyer's guide to selecting the right big data analytics software
A collection of articles that takes you from defining technology needs to purchasing options
There are many vendors selling products classified as big data analytics software. However, it's challenging to differentiate these products based on functionality alone, as many of the tools share similar features and capabilities. Additionally, some of the tools exhibit extremely subtle differences.
That being said, your key differentiating factors will likely focus on balancing ease of use, algorithmic sophistication and price in relation to your organization's capability and level of maturity in analytics.
In this article, we examine products from nine big data analytics software vendors: Alteryx Inc., IBM, KNIME AG, Microsoft, Oracle, RapidMiner Inc., SAP, SAS Institute Inc. and Teradata Corp. Some of these vendors provide more than one tool. See the "Leading vendors of big data analytics software" sidebar below for more details about their specific product offerings.
These vendors represent different facets of the big data analytics market. Let's compare and contrast the ways that these products meet the business needs of user organizations.
Analyst expertise and skills
Some data analytics tools are targeted to novice users, some are targeted to expert data analysts and some are engineered to appeal to both types of users.
Products such as IBM SPSS Modeler, RapidMiner's tools, Oracle Advanced Analytics and the Automated Analytics version of SAP BusinessObjects Predictive Analytics are generally designed to enable users with a limited background in statistics or data analysis to analyze data, develop analytical models and design analytics workflows with little or no coding.
While each vendor wraps its core analytics components with an intuitive user interface to guide the analyst's progress in data preparation, analysis, and then model design and validation, the approach taken may differ, especially when comparing a stand-alone product, such as RapidMiner, with one that's a component of a larger suite, such as the Oracle product.
Tools such as IBM SPSS Statistics, KNIME Analytics Platform, the Expert Analytics module of SAP BusinessObjects Predictive Analytics, Microsoft R and the Teradata Aster Analytics platform provide the more sophisticated functionality that expert users expect. Oracle R Advanced Analytics for Hadoop (ORAAH), one of the components in the Oracle Big Data Software Connectors suite, provides an R interface for manipulating Hadoop Distributed File System data and writing mapper and reducer functions in R. This flexibility may be appealing to more advanced data scientists.
Alteryx and SAS Enterprise Miner offer functionality adapted to the user's level of expertise, and essentially fall into both categories. Alteryx has added improvements to data profiling to help data scientists better understand their data sources. Overall, SAS Enterprise Miner and IBM's SPSS tools stand out when it comes to supporting more advanced analytical techniques and model scoring, as well as a broader array of analysis functions, including neural networks, association analysis and visualization capabilities.
Depending on the use case and application, your organization's users will be required to support different types of analytics capabilities that will use specific types of modeling, such as regression, clustering, segmentation, behavior modeling and decision trees.
While this has resulted in broad support for the various forms of analytical modeling at a high level, some vendors have invested decades of work into tweaking different versions of their algorithms and adding more sophisticated functionality. It's important to understand which models are most relevant to your business problems and to evaluate the products in terms of how they best serve your users' business needs.
The more mature and higher-end -- and, accordingly, higher-priced -- tools will exhibit the greatest analytical breadth. Oracle Data Miner includes an array of well-known machine learning approaches to support clustering, predictive mining and text mining. Both editions of IBM's SPSS product provide a diverse set of analytical techniques and models. And SAS Enterprise Miner supports many algorithms and techniques, including decision trees, time series, neural networks, linear and logistic regression, sequence and web path analysis, market basket analysis, and link analysis.
Leading vendors of big data analytics software
- Alteryx, which consists of a Designer module for designing analytics applications, a Server component for scaling across the organization and an Analytics Gallery for sharing applications with external partners.
- IBM, which provides SPSS Modeler, a tool targeted to users with little or no analytical background. IBM also offers SPSS Statistics, which is geared toward more sophisticated analysts.
- KNIME, an open source product commercialized by software vendor KNIME AG that includes an analytics platform and several commercial extensions for big data, cluster operations and collaboration.
- Microsoft R, which spans several products -- Microsoft R Open, a free download that is an enhanced version of the R programming language, and Microsoft R Client and R Server, which support the use of R in clustered environments, such as Hadoop.
- Oracle Advanced Analytics, which includes Oracle Data Miner, Oracle R Advanced Analytics for Hadoop and Oracle Big Data Discovery, as well as connectors and interfaces for SQL and R.
- RapidMiner, which provides a Studio component for design, a Server component, a Hadoop connector called Radoop and a component for stream processing.
- SAP Predictive Analytics, which comprises two versions: Automated Analytics for business users without a formal background, and Expert Analytics, which is targeted to professional data analysts and data scientists.
- SAS Enterprise Miner, which is intended to help users quickly develop descriptive and predictive models, including components for predictive modeling and in-database scoring.
- The Teradata Aster Analytics framework, which is offered by Teradata with its Aster database, is an analytics platform with built-in analytics functions, a graph processing engine, MapReduce and a version of R.
The newer generation -- and, in some cases, lower-priced -- products support different models, but perhaps with a narrower range of algorithmic sophistication.
The model inventory in Alteryx Analytics Gallery includes such capabilities as regression analysis, decision trees, association rule analysis, classification and time series analysis. KNIME includes methods for text mining, image mining and time series analysis, and also integrates machine learning algorithms from other open source projects, such as Weka and JFreeChart.
Another aspect of analytical diversity is integration with programming languages and statistical tools, such as R, for incorporating existing libraries, as well as user-defined functionality. In fact, integration with R could be considered an increasingly critical differentiator.
Alteryx Designer, Microsoft R, SAS Enterprise Miner, Teradata Aster Analytics, Oracle's ORAAH and KNIME's Analytics Platform all interface and support integration with R. Several of the vendors, including IBM, Oracle, Microsoft, RapidMiner and SAP, provide a growing library of extensions to R and Python, enabling users to take advantage of free libraries.
Scope of the data to be analyzed
There are multiple facets of the scope of the data to be analyzed, including the issue of structured vs. unstructured information, as well as access to conventional on-premises databases and data warehouses, cloud-based data sources, and data managed in big data platforms, such as Hadoop.
However, there are varying degrees of support for data managed within less-conventional data lakes -- either managed within Hadoop or in another NoSQL data management system intended to provide horizontal scaling. The factors for distinguishing among the products must be based on your organization's specific requirements for accessing and processing data volumes and data variety.
In recognition of the growing diversity of input sources and the variety of underlying systems used to house those data sets, another set of emerging features that is being adopted by these vendors involves data accessibility. IBM, RapidMiner, Alteryx, Oracle and Microsoft have all improved their tools' data import, export and connectivity capabilities. These enhancements should enable users to access a more comprehensive list of data sources while simplifying and speeding up the process of loading data into the products.
Support for scalability and high performance
The need for scalable performance is driven by your organization's data volumes and appetite for analysis. Smaller organizations with less data may be able to tolerate products that don't have performance characteristics that scale with the available resources, such as the entry-level versions of the lower-end tools, including RapidMiner, KNIME, Microsoft R Open and Alteryx Designer, which can run on desktop systems and don't require additional server components.
Larger organizations are more likely to have a greater inventory of data sets to analyze, as well as broader communities of users. This introduces two additional requirements -- high performance and facilitation of collaboration. The adaptability of a product to high-performance architectures is a good indication of scalability, and most of the products can be adapted to the parallelism of Hadoop or employ some other means of achieving faster computation.
All of the products do have some support for Hadoop, including IBM SPSS Modeler and SPSS Statistics; RapidMiner's commercial component Radoop, which connects the Studio front end and Server analysis engine to data stored in Hadoop; Oracle's Big Data Discovery and ORAAH tools; and KNIME's Big Data Extensions and Cluster Execution add-ins.
IBM SPSS now also provides enhanced support for a number of multithreaded analytical algorithms that may speed performance. Teradata Aster Analytics addresses high-performance requirements through its Massively Parallel Processing architecture. SAP's Expert Analytics version of SAP BusinessObjects Predictive Analytics can execute in-memory data mining for handling large-volume data analysis efficiently. Microsoft R Server leverages its ScaleR module, a comprehensive library of big data analytics algorithms that support parallelization. Scoring algorithms implemented using SAS Enterprise Miner can be deployed and executed within a Hadoop environment.
In addition, integration with Apache Spark appears to be of growing importance. SPSS, KNIME, Oracle, RapidMiner and SAP all provide access to Apache Spark libraries to support analytics applications that need to scale with exploding data volumes. This enables developed applications to take advantage of a high-performance cluster platform to distribute the workflow across the cluster.
As noted, the larger the organization, the more likely there will be a need to share analyses, models and applications across different groups and among many analysts. Organizations that have many analysts distributed across the enterprise may look for increased means to share models and collaborate regarding the interpretation of results.
IBM's SPSS Modeler Gold edition provides collaboration capabilities, and RapidMiner's Server product provides support for sharing and collaboration. Alteryx Analytics Gallery provides a mechanism for sharing sophisticated analytics applications in the cloud with members of an extended organization. KNIME offers commercial extensions to support team collaboration, as well as extensions supporting operational collaboration, such as remote-scheduled execution, report generation, shared data space and a workflow repository. SAS Enterprise Miner's client-server architecture enables business users and data analysts to work collaboratively by sharing models and other work products.
Alteryx, KNIME and Teradata Aster have added capabilities to help manage analytical workflows. Also, some of the vendors have started to look at ways to enable their tools to integrate with others that may have complementary functional sweet spots. For example, Teradata Aster now has an extension to integrate with KNIME that enables users to leverage the KNIME workflow editor and incorporate Aster Analytics functions into those workflows.
Vendor size and product integration
Vendors can be compared in terms of their size. One might compare and contrast what could be referred to as the mega-vendors, whose big data analytics tools are just one product among a massive portfolio of tools. If you work for a larger organization that typically negotiates site-wide, enterprise licenses for the full suite of a vendor's tools from a mega-vendor such as IBM, SAS, SAP or Oracle may be a reasonable choice.
The large vendors sell big data analytics tools that are a part of a much larger tool ecosystem. Presumably, the products from a mega-vendor will be at least somewhat integrated and intended to work together. In addition, some people feel more comfortable with bigger vendors, with an expectation of stability and consistent customer service. On the other hand, you may only be able to acquire these big data analytics tools as part of a much larger software licensing arrangement.
Smaller vendors, such as KNIME, Alteryx and RapidMiner, have revenues that are generally based on licensing and support for a small number of big data analytics products. A smaller vendor may provide closer contact with their product management and innovation teams, and you may be able to influence the direction of the product roadmap or enhanced functionality.
A smaller vendor might also be more flexible in terms of price and the features included in the licensing arrangement. You must realize, however, that working with a smaller vendor does present some risk in terms of stability, the resources available for support and the possibility that the company may be acquired, which can impact the customer relationship.
The larger vendors are clearly responsive to user needs for integration with other systems, although that often centers on other products within each vendor's inventory. For example, SAP Predictive Analytics has improved integration with SAP HANA and BusinessObjects Cloud. SAS Enterprise Miner has added nodes to execute code in a SAS open, cloud-ready, in-memory Viya environment. Microsoft offers SQL Server R Services, an R installation that runs alongside SQL Server and enables users to integrate Microsoft R Server data with SQL Server and Microsoft's other business intelligence tools.
Budget for licensing and maintenance
Almost all of the vendors sell different versions or editions of their products, with a range of costs for acquisition and total cost of operation. IBM, Oracle, RapidMiner, Teradata and Microsoft sell editions at different tiers, with the license cost proportional to the features, capabilities and freedom from limitations in terms of the volumes of data to be analyzed or the number of processing nodes the product can use.
KNIME and RapidMiner provide free and open source versions of their products, either charging for support services or for editions supporting enterprise-class applications. KNIME, RapidMiner and Alteryx have relatively low licensing costs for a smaller number of users. If you're considering SAS or SAP, you must contact them for pricing alternatives.
The marketplace for big data analytics software can be a confusing place, but hopefully this article has helped you understand the benefits big data analytics software can provide your organization, and assisted you in differentiating between the specific tools examined here.