It is with great pride that we get to announce that our own Chief Scientist, Claudia Perlich, has once again taken home a coveted prize at this year’s ACM KDD conference. Her paper, “Leakage in Data Mining: Formulation, Detection and Avoidance,” co-authored with two exceptional statisticians at Tel-Aviv University, has won the best paper award at the 2011 KDD conference. The competition for this award was extraordinary — over 700 papers from many of the leading machine learning experts and data scientists worldwide.
Claudia is no stranger to winning contests at KDD, which is one of the world’s top data mining conferences, attended by both academia and top industry players (like Google, Yahoo, Microsoft, and now m6d!). She has actually won their annual data mining competition three times in the past, and now sits on the committee that administers the competition. Her current “best paper” is related to her experience winning these competitions. She and her colleagues offer a formal analysis on a common pitfall in data mining & statistical analysis called “Leakage.” According to the paper, “leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from.” In other words, information related to the data you are trying to predict has “leaked” into the data you are using to make the prediction. A trivial example that might illustrate this is as follows:
I am tasked to predict which prospects in a given pool will purchase a product online after being shown an ad for the product. As the modeler, I pull all recent ad impressions from the data, and I use publisher, time of day and last site visited as my predictors. I also pull in who has and hasn’t purchased, and who has and hasn’t visited the checkout page. Now for those who purchased, the last site visited was the checkout page of the product being purchased. If this was in my set of predictors, it would get a very high weight in my model, though in practice, this would be a useless model. It is not feasible to target based on someone being on the checkout page, because the checkout page is the event that, by design, always precedes a purchase.
The above is a somewhat trivial example, but leakage is not a trivial problem. As the paper points out, this problem has occurred in many data mining competitions, designed by highly qualified statisticians. Kudos to Claudia and team for discovering the issue in these competitions and calling attention to the persistence of the problem. It is refreshing to read a paper that offers practical guidance around such a subtle, but model-effacing, misapplication of proper modeling methodology.
Reflecting on how this relates to our work at m6d, I am thrilled to have such creative and intuitive colleagues. We face new modeling challenges all the time, especially in such a fast moving and vast ecosystem. Every algorithm starts with a team of people trying to solve a problem, and oftentimes these problems are so new that no textbook provides a how-to guide to designing an optimal solution. In these situations, the algorithms are only as good as the statistical craftsmanship of the people who designed and planned them. In our case, Claudia is one of the finest craftswoman in the field of data modeling and analysis, and so we are very fortunate to have her on our team. And to our customers … never should you fear that leakage will ever corrupt your next campaign’s performance!