John Elder, CEO of advanced analytics consultancy Elder Research Inc. in Charlottesville, Va., thinks that building predictive models is more of an art than a science. And, he says, many companies stumble on predictive analytics and data mining initiatives because their predictive modeling techniques don’t take that into account.
SearchBusinessAnalytics.com recently conducted an email interview with Elder on predictive modeling. Elder, who is also the co-author of multiple books on analytics and data mining, detailed best practices for mastering the discipline of modeling and suggested tactics for avoiding common missteps. Excerpts from the interview follow:
What’s the No. 1 mistake companies make with predictive modeling?
Elder: The No. 1 mistake is using up all the data for [testing models] instead of setting some aside and not touching it, from the very beginning, so it can be a true evaluation data set. Reality -- the cases you will see on implementation -- will be exceedingly harsh to your model; it’s best to test how well your model will really do by duplicating that experience of absolutely new data while you know the answers. Oddly, though, when I do assessments of group after group and the mistakes they’ve seen, they rank this as the lowest on their list, when it should be the highest. The implication of over-training and “gentle” testing is that you can do much worse in use than you thought you would, which can be very damaging.
You talk about lacking the proper data as being a huge impediment to successful predictive modeling. What are the most common problems or limitations that organizations face regarding the quality or quantity of data in building predictive models?
Elder: Ideally, you’d have plenty of examples of exactly the sort of thing you’re looking for to learn from. You could then build a model to discover what factors predictively separate the good examples from the bad. But often, the most interesting cases -- for example, fraud or other high-cost anomalies -- are exceedingly rare. You have to use something similar or create an inferred risk score from lesser actions and build up a measure of overall risk. You have to work with “found” data rather than “designed” data -- instead of planting the crops you want, you have to make the best salad you can out of the weeds you find. But data mining done right is so powerful that it can still provide very useful prioritization for the best use of that precious resource: Your analysts’ time.
Using a broad palette of predictive modeling techniques is something you recommend. Why is that important?
Elder: You’ve heard, “To a little boy with a hammer, all the world’s a nail.” The technique shouldn’t be fixed ahead of time but should be the best response to the business need.
You also advocate “listening” to more than just the data. What do you mean by that?
Elder: The great promise of inductive modeling is learning from the data things you may not have thought of before. But also remember that the data is all that the modeling program sees of the world. If the data has a hidden bias or constraint, the machine will think that everything does. A great example is an artificial intelligence program that chugged overnight on a database being entered from the Encyclopaedia Britannica. It was looking for any hidden links or patterns that weren’t obvious from this great summary of common-sense or background knowledge. The result was, essentially, “Hey, everyone born before 1900 was famous!”
What are outliers and why are they so important to the overall quality of predictive models and the analytical outcome?
The most accurate models are ensembles: building multiple, competing models and combining their estimates gets the best results.
Elder: Outliers are points that are separated from the herd. They are most often mistakes but can themselves be findings. Either way, they can mess up the patterns of the rest of the data by dominating the conversation. They need to be identified, then dealt with individually. Often, the biggest breakthrough in data mining comes from the unglamorous steps of data cleansing, verification and transformation and less from running the fancy algorithms.
You say not to read too much into models -- that it can do more harm than good. Can you elaborate on what you mean by that and how to work around it?
Elder: Models can be useful without being explanatory -- and without being easy to interpret. Interpretability is nice but is a weak protection against building a bad model. Our natural impulse is to come up with a story about why the model liked certain variables, rather than to try to break the model we just spent a long time building. But many useful sources of information are usually spread throughout several variables. For example, information about account quality or behavioral risk or reliability often shows up in many correlated ways, so the exact variables the model chooses aren’t as important as they look. And the most accurate models are ensembles: building multiple, competing models and combining their estimates gets the best results.
If “easy to interpret” is made a high goal of the model, then you’ll be throwing away accuracy -- and that is what’s most important, in my opinion.
Beth Stackpole is a freelance writer who has been covering the intersection of technology and business for 25-plus years for a variety of trade and business publications and websites.