This article originally appeared on the
John Maindonald has been around the analytics block many times. In 45 years as a statistician, John has analyzed data using slide rules, mechanical calculators, electric calculators, mainframe, mini, and desktop computers – and now his trusty notebook. He has seen statistical methods and the software that deploys them change dramatically over the years. Along the way, John has accumulated quite a bit of knowledge – and now shares his wisdom with both students and the open statistical community through his current position at the Center for Mathematics and Its Applications of the Australian National University.
I first “met” John through his role as a well-regarded documenter and training materials contributor to the R Project for Statistical Computing. Several years ago, a colleague asked me to recommend introductory documentation on R. Of course, I noted An Introduction to R by Venables and Smith, and Peter Dalgaard's Introductory Statistics with R. And I can't omit Modern Applied Statistics with S by Venables and Ripley, though it's far from introductory material. At the same time, I found John's excellent beginning document: Using R for Data Analysis and Graphics - Introduction, Examples and Commentary, and have been using it as a trusty reference ever since.
Impressed with the pertinence of John's writing for business, I then purchased the book he coauthored with John Braun: Data Analysis and Graphics Using R – An Example-Based Approach, First Edition. I loved the book's focus on applied statistical methods supported by complete R programs. Then, several months ago, I noticed the Second Edition of Data Analysis and Graphics Using R while browsing Amazon. A search through the Table of Contents convinced me that there was enough new material to justify re-purchase. (There should be a law mandating stats books be sold by subscription, with updates included in the price!) I wasn't disappointed. The rewrite is an even more comprehensive compendium of statistical goodies and accompanying R code for business intelligence (BI). Those analysts looking for pretty complete applied coverage of regression techniques and linear models, as well as multivariate analysis and tree-based classification methods, can stop here.
Though his primary subject focuses have been biology, ecology, and medicine/healthcare, the graphical techniques and statistical methods John deploys are just as relevant for business inquiry. Now in semi-retirement, he is tackling several areas of major consequence for BI analysts, including the convergence of traditional statistics with data mining/machine learning, and the benefits of randomization and experimental techniques to prove cause and effect of interventions. Sounds suspiciously like a how-to-do-it version ofSuper Crunchers to me!
I caught up with John over the holidays between semesters of teaching, and he gave generously of his time for BeyeNETWORK.com readers. I hope you enjoy John's wisdom.
QUESTION: You are currently on the faculty of the Center for Mathematics and Its Applications of the Australian National University (ANU). Could you give our readers a bit of background on your career as a statistician and academic? How did you evolve to your applied or example-based approach to teaching statistical methods?
ANSWER: I am semi-retired, with what is these days an honorary position at ANU, primarily occupied with running short courses and writing. I relish the freedom to do what I think important and interesting.
After a first class degree in Mathematics at New Zealand’s Auckland University in 1959, it took me a while to settle on a career path. From 1963 to 1970, I lectured, successively, at Murray College (Sialkot, Pakistan), Auckland University (NZ), Manchester University (UK), and Sheffield Polytechnic (UK). At the same time, I pursued a developing interest in data analysis and in its role in scientific investigation.
Since 1971, I have worked as a mathematical and statistical problem solver and consultant. Initially, I worked mostly with biologists, then broadening out to work with industry, with medical researchers and, most recently, with climate researchers. In 1984, Wiley published my book Statistical Computation, at the time one of a small number of books in that area that were accessible to senior undergraduates.
I moved to Australia in 1996, taking up a position at the Australian National University (ANU) in 1998. I have relished the stimulus and intellectual challenge, bringing involvement with biologists, ecologists, public health researchers, molecular biologists, demographers, computer scientists, numerical analysts, machine learners, an economic historian and forensic linguists, as well as a lively group of statisticians.
QUESTION: Statistician Rudolf Beran has offered a modern definition of statistics as “the study of algorithms for data analysis.” He notes three different competing interests that characterize statistics: 1) the use of probability models (and randomization) to analyze behavior; 2) computationally effective statistical algorithms without concern for a probability model; and 3) data analyses, generally without randomization. To my probably naive thinking, these sound suspiciously like: 1) the traditional probability and statistics we learned in school; 2) machine learning for knowledge discovery; and, 3) exploratory data analysis. Historically, business intelligence focus has centered on the third – exploratory data analysis. Now, however, the first two items are much more in play in the business world – and growing. Your thoughts?
ANSWER: I take issue with Beran’s definition of statistics. Algorithms are useless on their own. In any data analysis, whether it is formal analysis or exploratory investigation, a key issue is how widely the results apply. Algorithms give no indication how or if results might generalize. That requires something like a probability model. Is the new group of customers, maybe in a different part of the country or maybe in a different country, sufficiently similar to the customers on which there are data that generalization is possible? A generalization from customers in New York to customers in Beijing may be highly hazardous. In a fast-changing industry – IT or energy industry – generalization a year or two into the future may be hazardous.
The language of statistics, or equivalent language, is essential in thinking about generalization. What is the source population from which the data come? What is the target population, to which results will be applied? Are they similar in any way? Is it possible to use a mathematical model to describe the differences?
Regarding the historical BI focus, the teaching of business students has focused on the use of Excel spreadsheets and on the graphs that one can do using Excel. That is very limiting, and often makes it difficult to do a really good job of data exploration.
QUESTION: In his well-received book Super Crunchers, Yale economist Ian Ayres notes the predictive superiority of analytics over experts in many disciplines, observing that “Unlike self-involved experts, statistical regressions don't have egos or feelings.” What are your experience and thoughts pertaining to experts versus analytics in academic and business worlds?
ANSWER: Ayres has given an excellent survey, and I strongly recommend it. Ayres does not come out quite so firmly in favor of equations as the question might imply. Think, for example, of piloting an aircraft. Most of the time, it is safest to leave control pretty much to the autopilot. But what if part of the tail has fallen off? Can the automatic guidance system detect this and still cope? Or should the pilot now exercise greater control? There must be “escape hatches” that allow the expert to take control when the expert knows something to which the equation is blind.
Each case must moreover be judged on its own merits. Sometimes the equation will do better, sometimes the expert will do better. Often, best of all may be an effective marriage between equation and expert.
There are other caveats. Regression modelers bring to their task varying levels of expertise, subject-area understanding and dedication. Experts need to regularly update their skills. Equations may, likewise, need regular updating. If you want to make anything of the regression coefficients or other parameters, there are other issues too big to capture in a sentence or two.
Note also that rigid use of an equation may make it easier for competitors to second guess business decisions. Will that matter?
QUESTION: Business intelligence (BI) can be defined as the use of data, technology, methods, and analytics to measure and improve the performance of business processes. In addition to promoting analytics, Super Crunchers points out that more and more companies are using experimental methods with randomization as a basis for BI initiatives, both to test their strategies and learn from findings. Could you comment on the benefits of randomized testing as a learning approach?
ANSWER: Randomized experiments, when it is possible to do them, are the only sure way to check the claimed efficacy of policies or decisions. The process of designing and carrying out and analyzing a randomization experiment can be highly instructive and enlightening.
A major thrust of the Ayres book is that ideas and methods that in the past were pretty much exclusively used by experimental scientists or those running drug trials are being applied by savvy business managers. Internet-based business is a fertile area for randomized experiments. What is the best strategy for keeping customers who’ve had a bad experience – a lost bag or an unpleasant flight experience? Customers that the company might otherwise lose can be randomly assigned to one of two or more damage minimization strategies. The applications are endless.
Incidentally, the use of randomized experiments long predates drug trials. R. A. Fisher pioneered their use in crop experimentation in the 1920s. Applications in industrial design and process control go back to the 1930s.
QUESTION: Evaluating the effectiveness of a company's strategy, called performance management in the business world, is a central application of business intelligence. The steps in performance management involve:
- Articulating the business strategy as a web of related
causal hypotheses of the form: If we do A, the better will be the company measure of B or,
alternatively, the less we do of C, the higher the measure of D. An example might be: If we
invest in a new customer-focused BI initiative, we will decrease customer defections to competitors
and increase customer profitability.
- Operationalizing measures for A and B using company
data and technology.
- Testing/modifying the hypotheses based on design, models, and analytics.
Is this a reasonable template to help improve business processes from a statistical perspective?
ANSWER: In principle, yes. Think carefully, though, about the implications of one or the other choice of measure. Beyond this, the issues are too big to discuss further here, and I have no specialist experience in this area.
QUESTION: You have written a lot on the convergence of statistics and data mining. Could you summarize your current thinking for us? Arnold Goodman of the University of California, Irvine Center for Statistical Consulting, says: “Knowledge Discovery rests on the three balanced legs of computer science, statistics and client knowledge … Successful knowledge discovery needs a substantial commitment to collaboration from all three.” Do you agree? Is there friction between the “finesse” of statisticians and the “muscle” of data miners?
ANSWER: I do indeed agree with Goodman’s statement. I don’t think there is much friction. Rather, those who call themselves data miners and those who call themselves statisticians tend to each go their own way. That is a pity because input of all the relevant skills will yield the best result.
There is a lot of data mining activity that can be seen as an attempt to reinvent statistics! Much as in the beginnings of the widespread teaching of statistics in the universities, most data mining has used very simplistic statistical assumptions – independent sampling, often normality, and identity of the source and target populations. Predominantly, it is statisticians who have pioneered the use of the new computer age methods – computer simulation and related computer-intensive methods that can often handle situations where the theory breaks down or needs that kind of help. Data miners and machine learners have contributed to the development of tree-based methods and neural nets. Somewhat in contrast to data mining, a large part of the machine learning community does pay serious attention both to theoretical statistical insights and to the help that can be gained from computer-intensive methods.
One computer-intensive method that data miners use a lot is cross-validation. This is a clever way of using all the data, but still maintaining the distinction between training and test data. The cross-validation accuracy estimate is valid on the assumption that the target population is identical with the source population, and sampled in the same way.
QUESTION: You define statistics as: “The science of collecting, analyzing, and presenting data.” Data mining has been defined as: “Statistics at speed, scale, and simplicity.” Do you see data mining as extending, scaling, and automating statistical analysis?
ANSWER: Yet another definition of statistics! I don’t know whether data mining extends statistical analysis. Thoroughgoing automation is hard, which is why artificial intelligence failed to deliver on the high promises that it made in the 1960s. My take is that one has to start by building a really good statistical system, at least as good as R and probably a lot better. It really is necessary to get the details right. For example, how does one account for the fact that the demands of customers in the same country neighborhood are likely to be more similar than the demands of those in different neighborhoods. Demands in the same month or week will be more similar than when separated by 12 months. What are the prior probabilities? Those sorts of issues, and more, have to be addressed before it is realistically possible to think of building an expert system that has wide-ranging abilities.
That said, there are certain types of classification problem where it does make sense to assume a very simple random sampling model, and a highly automated analysis is possible. These problems have been the staple of data mining.
Classification trees have proved very successful. The best software, such as Therneau and Atkinson’s rpart package for R, has a built-in use of cross-validation that gives realistic estimates of predictive accuracy, albeit for the population from which the data have been sampled.
Note incidentally that Breiman and Cutler’s random forests, implemented in R’s randomForest package, can often improve substantially on individual classification trees. I’d like to see a properly validated head-on comparison, extending over datasets with diverse characteristics, between random forests, support vector machines and neural nets.
The really significant inroads of data mining may be in working with new types of data, and/or with data that can now be collected automatically. I have in mind text and image mining, perhaps collecting data from the Internet. Huge databases are now being populated by automatic monitoring equipment. Data mining has been most successful where the key challenge is to process the data into a form where it can be analyzed by a relatively automated use of classification or regression methods.
QUESTION: Statisticians who plan investigations are typically very concerned with adequate sample size. In business, predictive modelers often enjoy the luxury of hundreds of thousands or even millions of cases to work from – but often without random sampling or assignment to groups. How does sample richness change analyses for business? What cautions would you offer to business modelers who have millions of records to analyze? Which would you prefer as an investigator looking to “prove” cause and effect: a small sample, random assignment, or a large sample, natural assignment?
ANSWER: Small sample, random assignment is often best. The assignment is then, on average, the only difference between the two or more “treatments” that can affect the result.
There are other important issues. Data are rarely a single homogeneous collection. There may be multiple cases for the one customer, and multiple customers from similar locations and times. In other words, the data have an internal structure. Whether assignment is random or natural, the analyst has to account for this structure. For random assignment, what is the unit of randomization – transaction, or customer, or…? Data accuracy may be an issue.
QUESTION: You are a hands-on statistician, with a focus on statistical computing and graphics. Could you tell us of the evolution of statistical computing platforms over the course of your career? Have the platforms themselves helped shape your approaches to statistical analyses?
ANSWER: The changes have been spectacular. I’ve had experience of hand arithmetic, slide rule, hand calculator, electro-mechanical calculator, a variety of electronic calculators, several different mainframe computers, mini-computers, many different desktop computers and now, of course, laptops. The advances that finally matter have been in software, with the hardware important because of the sophistication of the software that it can now support.
The effect on my approach to data analysis has been spectacular. I can now do checks that three decades ago seemed a luxury. Clever fudges that were formerly needed to handle data structure properly are now mostly unnecessary. I have access to graphical abilities that, relative to what was available even ten years ago, seem out of this world. It is a crying shame that so many analysts seem stuck with Excel’s limited and inadequate graphical tools.
QUESTION: Your books, tutorials, and training materials are held in the highest regard in the R open source community. Can you comment on the impact of R and the open source model on the statistical world over the last ten years and going forward? Will R be adapted to handle the speed and size demands of business data mining applications?
ANSWER: Thank you for the compliment. The R system is in some ways a better example than Linux of the success of open source. The overwhelming majority of the world’s leading statistical computing professionals are now involved, one way or another, with it. In its scope, power and overwhelming acceptance among statistical professionals, it has no serious competitors. It is the development environment of choice for new methodology.
It is finding its way into many different application areas – including, in particular, financial mathematics. It is fostering dialog between specialists from different areas of statistical application. The R community and R itself are proving marvelous vehicles for cross-fertilization of methods and ideas between different areas of specialization.
Speed, size and other capabilities are continually improved, thanks to the efforts of the R Core Group and others who chip in. Improvements since around 2000, when R became a serious tool for professional use, have been spectacular. With 2GB of memory, a regression with 500,000 cases and 100 variables should be possible. Note that data structure is, typically, an even more important issue for large data sets than for small data sets. Additionally, it may actually be most useful to do repeated smaller analyses with subsets of the total data. This is too large a subject to pursue further here.
QUESTION: Your Data Analysis and Graphics Using R book is excellent reading for those seeking to learn how to correctly apply statistical modeling techniques to business problems. Many statisticians scold the academic and business worlds for worst practices in statistics and knowledge discovery. Do you think the prevalence of bad statistical analysis is growing? If so, is it related to the ease of using (and abusing) predictive modeling/data mining software? What can/should be done?
ANSWER: Any community that is serious about standards will be monitoring the quality of the analysis and reporting of statistical results. Such monitoring is unusual. The R system removes any excuse there might have been for bad analyses. The demands for data analysis are, however, increasing much faster than the availability of well trained data analysts. The answer lies in training more and better statisticians, in extending crucial parts of that training to application area specialists, and in regular review of published statistical analyses.
- John Maindonald and John Braun. Data Analysis and Graphics Using R – An Example-Based Approach. Cambridge University Press. Second Edition. 2007.
- Peter Dalgaard, Introductory Statistics with R. Springer-Verlag. 2002.
- W.N. Venables and B.D. Ripley. Modern Applied Statistics with S. Springer. Fourth Edition. 2002.
- Ian Ayres. Super Crunchers – Why Thinking-By-Numbers is the New Way to Be Smart. Bantam Books. 2007.