R, the open source predictive analytics language, has been around for almost 20 years. Its time may have finally
Originally developed by statisticians Ross Ihaka and Robert Gentleman at the University of Auckland in 1993, R is a free software language for statistical computing and graphics. It allows statisticians and others with sophisticated programming skills to perform complex data analysis and display the results in any of a multitude of visual graphics.
While useful in the hands of skilled statisticians, for most of its lifespan R open source analytics was not accessible to less sophisticated users. Nor was it designed to handle large data volumes. And R has long been overshadowed by proprietary predictive analytics technology from SAS and SPSS, now part of IBM.
A company called Revolution Analytics, headed by Norman Nie, the co-founder and first CEO of data mining specialist SPSS, is one of a handful trying to bring R out of the shadows. Founded in 2007, Revolution Analytics’ core product, Revolution R Enterprise, includes a front-end graphical user interface that makes it easier for non-statisticians to work with the R language.
Instead of having to write complex R code, users can point-and-click to create sophisticated data models, according to David Smith, Revolution’s vice president of marketing.
And earlier this month, the vendor released an add-on package it calls RevoScaleR that allows for analysis of large data sets. The new framework makes it possible, for example, to run complex analysis on 10 million rows of data with six variables in less than a minute and a half, Smith said. The same job previously took 10 minutes or more, he said.
The time may be ripe for Revolution’s mission. General business intelligence (BI) reporting and ad hoc query capabilities are quickly becoming standard tools at many large organizations. Predictive analytics is quickly emerging as the next great competitive differentiator, and R could become its leading language, according to statisticians and industry observers.
R delivers top-notch data visualizations, flexibility
Hadley Wickham has been working with R since his undergraduate days at the University of Auckland, the open source language’s birthplace. Wickham, who received his bachelor’s in statistics and computer science from Auckland in 2002, earned a Ph.D. in statistics at Iowa State University and is now himself a statistics professor at Rice University in Houston.
In 2006, Wickham coauthored ggplot2, an open source R package for producing statistical graphics and data visualizations. His goal, he said, is to find the best ways to analyze large, complicated data sets and present the results in the most compelling and useful visualizations possible.
Wickham has worked with other predictive analytics tools and languages, but “I think R is pretty much unparalleled in that respect,” he said.
First, R produces the highest-quality visualizations of any language he’s used, Wickham said, a claim made by other statisticians as well.
But R also benefits from the open source model. Developers and statisticians are constantly experimenting with the predictive analytics language, and new R-based visualizations are developed much more rapidly than proprietary-based visualizations.
“One of the real strengths of R is the community,” Wickham said. “You get access to that absolute cutting-edge stuff.”
R’s biggest differentiator, however, is its flexibility. Because of its code-based interface, statisticians and programmers have the ability to customize predictive analytics models and visualizations down to the tiniest detail, Wickham said. Most proprietary analytics tools simply don’t offer that level of granular control, instead offering users a pre-populated library of visualization templates.
Marick Sinay agrees. Like Wickham, the quantitative financial analyst with Bank of America has been working with R since his college days.
“[R] just gives people flexibility,” Sinay said, who was speaking for himself and not Bank of America. “I was able to write my own programs, write my own algorithms,” which was impossible, or at least more difficult, with proprietary predictive analytics tools like SPSS.
After stints teaching at UC Santa Barbara and consulting for the auto industry, Sinay joined Bank of America in 2009. He was happy to find that his fellow analysts at the bank were themselves using R for a variety of predictive analytics functions. Also like Wickham, Sinay touted the superiority of R’s visualization capabilities.
“R produces publication-quality [visualizations],” Sinay said. “SAS’s graphics are years behind,” he said, adding that they “look kind of tacky” in comparison.
Since joining the bank, Sinay has continued using R to forecast economic and regulatory capital requirements per BASEL rules. He said he likes the transparency R provides compared with other predictive analytics tools.
In a recent operational risk management project, for example, Sinay applied what is known as Monte Carlo methods via R to a large data set. Because he wrote the code in R himself, he was able to dig into the resulting analysis and experiment with different algorithms to a degree he wouldn’t be able to do with proprietary software, he said.
“For me to pass it off to SAS,” Sinay said, “[means] not know[ing] exactly what it’s doing and not being able to make modifications.”
“If I’m pointing and clicking, I never really know what it’s doing,” he said.
GUIs could bring R to the masses
But point-and-click functionality is just what Nie and Revolution Analytics are trying to bring to the world of R. The company is betting that the benefits of a simple graphical user interface on top of R will offset the resulting decrease in customization and flexibility.
And it may. While Revolution R Enterprise probably won’t appeal to statisticians who take pride in writing complex code in R, Nie hopes it will bring the power of R and its unparalleled visualizations to a new class of user: the casual business user.
James Kobielus, an analyst with Forrester Research, said user-friendly predictive analytics tools could “play a pivotal role in day-to-day business operations.” Line-of-business managers, for example, could use the technology to test what-if scenarios and adjust spending plans to take advantage of the latest economic forecasts.
Predictive analytics capabilities in the hands of inventory managers could help companies better streamline their supply chains. Even call center representatives, given predictive analytics capabilities, could use the technology to find up-sell and cross-sell opportunities while they have customers engaged on the phone.
But none of this is possible without making the technology easier to use.
Bank of America’s Sinay, for one, thinks there is room enough for both GUI-enabled R-based applications like Revolution R Enterprise and the more stripped-down, bare-bones form of the R programming language he often uses.
“I definitely think there’s going to be both,” he said, noting that R Commander, another R GUI, is gaining popularity with entry-level users. Other R GUIs currently in development include Rattle GUI, gretl, and Deducer.
Revolution’s product also makes it possible to process large data sets, which is difficult using the stripped-down version of the open source language because of memory issues. Analyzing so-called “big data” with straight R is time consuming, Revolution’s Smith said.
Sinay, in fact, sometimes uses Revolution’s product when analyzing large volumes of data.
“There’s going to be a certain segment of the population that’s going to want point-and-click [predictive analytics capabilities],” Sinay said. “For a certain niche of people, I definitely see that this is going to be an avenue that they’ll pursue.”
Wickham agreed, conceding that R is “not that useful for casual users.” But for him, nothing beats R in its rawest form.
“Just using a GUI is a real step backwards,” Wickham said. Especially for companies with more complex predictive analytics needs, “you’re better off spending the time to learn the underlying capabilities of R.”