In 2006, Netflix released millions of film ratings generated by thousands of its users and challenged the public to create a more accurate predictive model for its site. As incentive, the DVD rental company dangled a $1 million carrot to the first team that could refine its Cinematch software’s accuracy by at least 10%. While the prize money may seem steep, the contest attracted 50,000 entrants from around the world over a three-year period, including statisticians and computer scientists.
Competitions that use crowd-sourcing -- or what’s known as prize economics -- are not new. The X Prize Foundation has been hosting competitions on exploration, education and life sciences innovation since 1919. Just a few years ago, the federal government launched a website called Challenge.gov, which lists contests on everything from solar power to nutrition, and welcomes participants to test their skills.
It’s not even a new concept for analytics: Colleges and universities have participated in data-mining competitions for years through events like the KDD Cup, which dates back to 1997. Today, though, data-mining competitions are no longer beholden to an academic setting and could become a model for companies seeking a low-cost approach of bringing analytics into the business.
Crowd-sourcing competitions lead to bigger talent pool
In May, a contest called “Mapping Dark Matter,” supported by NASA and the Royal Astronomical Society, saw Martin O’Leary take the front-runner position just 10 days into an 86-day competition. O’Leary, though, isn’t a statistician or a data scientist. Instead, he’s a glaciologist and uses satellite images to try and pinpoint the edges of glaciers.
O’Leary wound up placing fourth in the competition, but his run at the analytics problem is a significant example of how competitions like this one opens up the talent pool.
“The advantage is that you pull in the best and brightest for people who have the skill set along with others who may have the capability but wouldn’t fit the typical mold of a data scientist you would hire in your company,” said Rita Sallam, an analyst with Stamford, Conn.-based Gartner Inc.
“Mapping Dark Matter” appeared on the government-sponsored website Challenge.gov, but it also appeared on another site as well -- Kaggle.
Kaggle Inc., which launched last year and recently completed an $11 million round of investment funding, is a platform for data-mining problems sponsored by organizations. Scientists, analysts, students and data hobbyists troll through the dilemmas and try to solve them, spurred on by competition, bragging rights and financial reward.
To help the competitive element along, Kaggle created a leaderboard, which gets updated in real time so participants can see how they stack up against the competition.
“We’re taking the [Netflix] concept and we’re popularizing it,” said Anthony Goldbloom, founder and CEO of Kaggle, which has more than 16,000 competitors signed up on the site. “Our tagline is that we make data science into a sport.”
According to Goldbloom, the site is garnering interest from government agencies, health care organizations and businesses alike. In fact, Kaggle is now hosting a contest called the “Heritage Health Prize Competition,” which provides participants with historical claims data to build an algorithm predicting hospital admissions by 2013. Although prize money varies from contest to contest based on how much the organization is willing to wager, the individual or team with the most accurate predictive model to hit or exceed this sponsor’s benchmark could walk away with $3 million in prize money.
Before betting it all, consider this
Businesses interested in sponsoring a data-mining competition on a site like Kaggle should first evaluate whether an analytical algorithm is needed for an aspect of the core business, and then whether they lack the skills in-house to build the predictive model, according to Sallam.
“If you determine this type of capability is critical enough to your competitive advantage … this seems like a relatively low-cost approach to leverage the best and brightest,” she said.
That kind of thinking went into Allstate Corp.’s “Claim Prediction Challenge,” which it launched on Kaggle in July for a $10,000 prize. The Fortune 100 company sought to refine its algorithm on predicting bodily liability insurance, which provides medical and legal coverage when a driver causes an accident that injures another person.
“To maximize our benefit, we posed a question and submitted a model we were already interested in updating,” said Stephanie Sheppard, an Allstate spokeswoman.
Allstate, for example, used data from 2005 to 2007, but stripped out identifying characteristics of insurees and included only generalized characteristics of insured vehicles and bodily injury losses.
“No personal information was provided for the competition,” Sheppard said, “and we obscured our make and model information to keep the original data private.”
Sallam recommends businesses review a site’s intellectual property policy, specifically on ownership of the model.
“Once the winner is chosen, who owns the model?” she asked. “That could potentially be problematic for a business like Netflix, for example, where the algorithms become critical to their competitive advantage. You wouldn’t want someone owning it and then turning it around and selling it to a competitor.”
For Kaggle, prize money essentially pays for the ownership of the model. In Allstate’s case, the insurance company awarded a total of $10,000 to the top three performers, with the first-place winner -- Matthew Carle from Sydney, Australia -- producing a model “that was 340% more accurate than Allstate’s existing method for predicting claims based on vehicle characteristics,” according to Kaggle’s website. (Allstate’s official internal benchmark placed 53rd in the competition.)
Sheppard described the winning model as having “impressive accuracy” and said the company is still in the process of reviewing all three top-performing entries with plans to use insights gained to complement the company’s predictive modeling techniques.