A buyer's guide to selecting the right big data analytics software
A collection of articles that takes you from defining technology needs to purchasing options
Big data analytics tools enable users to analyze a wide variety of information -- from structured transaction data to social media posts, Web server log files and other forms of unstructured and semi-structured data. Once your organization has decided to buy a big data analytics tool, the next step is to create a process for evaluating the available products and then find the one that best fits your needs and requirements.
Let's examine the must-have features and specific attributes that can be used to assess how well the various big data analytics tools available will meet your organization's needs. You can then develop a request for proposal (RFP) by mapping how those needs are satisfied using these tools.
Breadth and depth of modeling techniques. Vendors have applied different levels of effort and, correspondingly, have developed analytics capabilities with diverse levels of sophistication. The breadth of the analytical modeling that's supported by individual tools reflects the different approaches provided. Some examples include regression techniques, time series models that predict variable values based on an analysis of past trends, classification and regression trees (also known as CART), and neural networks.
The depth of the modeling techniques characterizes two aspects of the approaches employed: the algorithmic sophistication that provides greater accuracy and precision of the developed models, and the flexibility of the modeling techniques. In other words, what level of expertise in data mining and predictive analytics is necessary to understand what kinds of models can be developed, and how can they be built with a particular tool?
Less experienced data analysts may be interested in vendor products that provide a broad swath of analytics capabilities, whereas more expert analysts and statisticians may prefer tools with greater depth in specific types of analytic models.
Integration and accessibility. Big data analytics applications often rely on a growing number of internal and external data sources, containing both structured and unstructured data. This drives a need for functionality supporting data accessibility and systems integration. Features to consider in this area include the following:
- Unstructured data utilization. Verify that the product is able to ingest the different types of unstructured data (documents, emails, images, videos, presentations, streams from social media channels, etc.) and can parse and make sense of the incoming information.
- Big data accessibility. Compare how the vendors' tools connect to big data architectures, including distributed data stored in Hadoop, as well as files managed within other types of scale-out storage (for example, NoSQL databases such as MongoDB or Apache Cassandra).
- Interoperability with existing platform components. This is crucial when there's an expectation of blending more traditional data management and BI practices with analytics methodologies. For example, many analytics tools allow analytical models to be invoked through traditional SQL queries. This form of interoperability allows the results of the predictive models to drive the kind of querying and reporting with which more traditional data analysts are typically comfortable.
- Connectivity. It's important to assess connectivity, or how well the product can access other systems as well as act as a source to feed results to established platforms for reporting and analysis.
Ease of use. Some big data analytics products have been built from the ground up by vendors, while others are based on the open source statistical language R. In either scenario, this evaluation category focuses on how easy the product is to use for analyzing data, developing models and determining the efficacy and accuracy of the models. Evaluation criteria should include the following:
- Usability for business analysts. Can business analysts without a statistical background easily develop analyses and applications? Check if the product provides visual methods that facilitate development and analytics uses.
- Flexibility in deployment for different business use cases. As suggested in a previous article, the same algorithmic methods can be applied in many different business scenarios for different industries. If the kinds of analyses your organization plans to do are somewhat limited and are centered on more general use cases (such as customer lifetime value analysis, fraud analysis or retention prevention), you may be able to tolerate less flexible techniques. However, should your organization desire a broader and less constrained approach to analytics, look for a greater degree of modeling flexibility.
- Model scoring. This includes additional tools that help analysts automatically compare the accuracy, efficacy and predictive value of different predictive models intended for similar business scenarios.
- Collaboration. Isolated analysis and development can lead to replicated efforts and uncoordinated results. Providing a means of integrating collaboration capabilities and sharing analytical models as part of the big data analytics platform enables analysts to work together to refine their applications and subsequently reuse the same models, thereby lowering development costs while increasing consistency.
System management capabilities of big data analytics tools
The practical aspects of integrating a new technology into the organization must also be considered. Evaluating the simplicity of administration and configuration includes understanding any system requirements and dependencies for installation, configuration and ongoing management. For example, the big data analytics tools that take advantage of the statistical models in R require that the R environment be acquired and installed at the same time the products are installed. This will also include identifying the platforms on which the product may be installed, as well as determining the platforms that can embed the developed models and applications.
Other considerations include security associated with the designation of roles and access rights for both the analytics process and the incorporation of developed models into business applications. Explore what options the products provide for authentication, authorization and access control.
Most high-end Hadoop platforms and specialty appliances are engineered to provide multiple compute nodes for parallel processing and distributed computing. If a high level of execution performance is a requirement, it's critical that the products you evaluate take advantage of massively parallel processing (MPP) system configurations.
Using an MPP platform introduces a need for the selected tool to efficiently use the platform's performance optimizations, including:
- Parallelism and data distribution. Parallel systems work best when parallel processes execute independently on data sets that are distributed in a way that minimizes network bandwidth and maximizes data locality. Review how the product's parallelization optimally dovetails with the data distribution strategy.
- The product's push-down capability. This enables the analysis algorithms to take advantage of the inherent capabilities of the other components of the system stack. An example would be if a database management system provides parameterized modeling utilities as part of its tool suite, and those utilities have been natively optimized to take advantage of the architecture features in the DBMS. In this case, it's wise for the analytics tool to use the native capability rather than attempt to replicate it.
- Scalability and elasticity. As data volumes expand and data management platforms are scaled out, assess how the different analytics products are intended to scale linearly with the increased processing and storage capacity.
The cost of big data analytics tools
In most cases with big data technologies, the prices of products understandably influence the buying decision. Some big data analytics tools are costly, while others are low-cost or, in some cases, free. Alternatively, a vendor will provide different features, capabilities or freedom from constraints (such as limitations on analyzed data volumes), depending on the price to be paid.
Another consideration is the need for special services. For each of the products to be evaluated, assess whether it's necessary to engage the software vendor or external experts to help with installation and training or to provide specialty development services.
Also, be sure to consider the long-term total cost of ownership (TCO) on the tools you're evaluating. TCO calculations should include annual maintenance fees and the allocated associated costs for the system stack supporting the product, as well as an allocation for operations and maintenance staff, data center space, cooling and other utilities.
Developing your RFP
Narrow down your set of candidate vendors by filtering out those whose products best address your organization's specific use cases. Examine how your organization's requirements map to the evaluation categories described above and create an RFP that, aside from the standard set of questions about integration, interoperability and corporate details, focuses on quantifying conformance with your expectations for factors such as analytical modeling, data volumes, necessary levels of expertise and data accessibility requirements.
Determine the most critical differentiating factors, such as the ability of the product to scale and perform well based on growing data volumes, its ability to consume unstructured data, and the breadth and depth of the modeling capabilities. At the same time, develop questions that reflect the needs of your user community, especially if there are analysts with different levels of expertise or there's a need for enterprise collaboration. In addition, key influencing factors for selecting a big data analytics tool include its initial price, its staffing requirements and its total cost of operations, making questions about cost and budget relevant to the evaluation.
Articulating and prioritizing the business needs and then specifying the expectations from the pool of vendor products will enable your acquisition team to map the business needs to the categories for evaluation. Configure your RFP by reviewing the list above, defining the questions that need to be asked, and specifying the acceptable responses to determine the degree to which any specific product meets your needs.
Good grip needed on big data analytics tools and applications
Guide to big data analytics tools, trends and best practices
Streaming data systems take big data analytics into real-time realm
Learn more about solid data integration techniques