The Human Rights Data Analysis Group is a nonprofit organization that analyzes data coming out of war-torn regions around the globe to develop accurate assessments of casualties. The work is stats-heavy, so the group needed a tool that would allow it to develop in-depth models.
"Serious data analysis is not something you're going to do using a mouse and drop-down boxes," said HRDAG's director of research Megan Price. "It's the kind of thing you're going to do getting close to the data, getting close to the code and writing some of it yourself."
This is why HRDAG decided to deploy a product from Revolution Analytics that is based on the R programming language. Price said for organizations that are doing deep statistical analyses, R can be an attractive tool.
The popularity of R has grown significantly in recent years. A 2013 survey of data mining professionals conducted by Rexer Analytics indicated that the R programming language is by far the most popular statistical analysis tool, with 70% of respondents saying they use it at least occasionally. The previous year's survey showed that 47% used the tool. Programmers with R skills are commanding generous salaries. A recent survey of IT professionals conducted by Dice.com found that R programmers are the highest paid big data professionals. With average salaries coming in at $115,531, R programmers are compensated better than MapReduce, NoSQL and Cassandra programmers.
The growing popularity among programmers and organizations has attracted the attention of technology vendors, many of which now offer R-based products or support the language in their own software. Revolution Analytics, one of the primary vendors offering entirely R-based products, was named a visionary in Gartner's inaugural Advanced Analytics Magic Quadrant. StatSoft, another company offering R-centric products, received Gartner's challenger distinction. More traditional companies, including Oracle and SAS, are jumping on the bandwagon by offering support for R in their products.
R is not the best for running against big data. It can choke on that.
analyst, Forrester Research
But this enthusiasm doesn't mean the R programming language isn't without its drawbacks. David Smith, chief community officer at Revolution Analytics and author of the manual An Introduction to R, said the language was developed primarily for small-scale statistical analyses in academic settings. In today's big data world, some of its founding features have turned into bugs.
"[R] is extraordinarily powerful when it comes to exploring data, visualizing data and developing new statistical models," Smith said. "But because innovation was its design focus, R doesn't address issues like performance and scalability and integration into enterprise IT systems."
One problem with open-source R, Smith said, is that it is single-threaded. A user might run R on a machine that has a multicore processor, but R inherently will run jobs through just one processor, slowing down analyses. Additionally, it was designed to run on a single desktop computer, which makes it difficult to integrate into larger enterprise systems. Finally, R is inherently memory-bound. It processes data in-memory, which can limit the volume of data it can process. This has led to the main criticism of R, which is that it doesn't scale up to big data. Smith said Revolution's products address these problems.
"R is not the best for running against big data," said Mike Gualtieri, an analyst with Cambridge, Mass-based Forrester Research. "It can choke on that."
Additionally, Gualtieri said the language is very technical, which means a steep learning curve. Organizations that are looking for more visual software might find better options, he said.
But in the right setting, the R programming language can be a powerful tool, Gualtieri said. Vendor products based on the language are often less expensive than more traditional analytics software. Additionally, there is an active community of R users who publish a library of prewritten scripts called the Comprehensive R Archive Network (CRAN). If a user wants to perform a certain kind of analysis, chances are they will find an appropriate pre-written script in CRAN, which Gualtieri said can save a programmer a lot of time.
The CRAN library is one the features that Price, from the HRDAG, likes best. She said it helps her small organization make the best use of its limited resources.
"Nine times out of 10 when we need code that does a certain kind of model or does a certain kind of analysis someone's gotten there ahead of us," she said. "Not only do we have a package that we can start from but typically we also have someone we can contact and ask questions."
As for the criticisms that R doesn't scale up and is too technical, Price said they are fair. But she said the decision of whether to use R or some other tool comes down to an organization's use case. HRDAG works on relatively small data sets. If you're going to be doing in-depth statistical analysis, you'd better get ready to flex your coding muscles one way or the other.
"In this world, if you're going to do these analyses, whether you're using R or Python, you're going to have to sharpen your programming skills," Price said.