Petya Petrova - Fotolia

Pentaho 7.0 update combines data preparation tool with analytics

Users increasingly want data preparation to be tightly integrated into the analytics process to shorten time to insight, and Pentaho's new 7.0 software release aims to satisfy that need.

Pentaho's software has always been geared to combining data integration and analytics, but with its upcoming 7.0...

platform, the company is taking a further step down that road in an effort to accelerate and improve the data preparation process.

Due for general release in November, the upgraded analytics, integration and data preparation tool lets users visually inspect data at any point in the processing and preparation pipeline. That means data scientists, data engineers and business analysts can use charts, graphs and other visualizations to validate data on the fly and potentially address data quality issues before running an entire analytics job.

For example, they can see if combining two tables of data results in too many missing values to support accurate analytics, or if applying a regression analysis technique to a data set during an extract, transform and load (ETL) integration process produces erroneous information. Pentaho 7.0 also allows IT teams to publish predefined data sources for business users in order to boost collaboration during data preparation efforts.

The ability to visually inspect and assess data in a collaborative way as it's run through a preparation routine should reduce the amount of time that data scientists and other users need to spend getting data ready for analytics uses, according to Pentaho, a subsidiary of Hitachi Group Ltd. that's based in Orlando, Fla. In addition, data engineers will be less likely to create data pipelines that have data quality problems, the company said.

Pentaho 7.0 also includes new integration with Spark SQL, enabling ETL developers and data analysts to use the variant of standard SQL to query data in Apache Spark clusters. Several other features similarly designed to better handle big data environments are being added as well, including support for the Kafka message queueing system and the Avro and Parquet file formats.

David Menninger, a technology analyst at Bend, Ore.-based Ventana Research, said Pentaho's new functionality for blending analytics with its data preparation tool reflects an ongoing trend in data management.

Increasingly, enterprises are looking to tie together the tasks of data preparation and analytics more closely and make the combination more of a self-service process, Menninger said. "Self-service data preparation has become all the rage. Realistically, it needs to be tightly integrated with the analytics process."

For now, Menninger thinks Pentaho is ahead of the rest of the market in making that happen, but he expects other vendors to catch up with similar functionality relatively soon.

In an example of that, Paxata -- one of several self-service data preparation tool vendors to emerge over the past few years -- is working to broaden its software with higher-level features. Planned additions include capabilities for guiding users on required data transformations and helping them to better understand data at the semantic level via machine learning technology.

The Redwood City, Calif., company took an initial step last month, releasing an update of its namesake software with a new Paxata Connect technology that can pull together data from different Hadoop clusters, NoSQL databases and other systems. Paxata plans to add more functionality on a quarterly basis, but fully building out the envisioned platform "is a multiyear journey," said Nenshad Bardoliwalla, its chief product officer.

Executive editor Craig Stedman contributed to this report.

Next Steps

Data prep tools make analytics info available to more users

Effective data preparation needs to be at the center of analytics

Listen to a podcast on the evolution of data preparation processes

Dig Deeper on Business intelligence architecture and integration