Client Login
  • Contact Form

  • Locations

    New York
    37 East 18th Street
    9th Floor
    New York, NY 10003
    646-278-4921
    View map
    Chicago
    411 N LaSalle
    4th Floor
    Chicago, IL 60654
    312-436-0968
    View map
    Los Angeles
    8611 Washington Blvd
    Culver City, CA 90232
    206-778-5084
    View map
    San Francisco
    604 Mission Street
    Suite 1000
    San Francisco, CA 94105
    View map
    London
    11 Palace Gate
    London
    W8 5LS
    44 7932 759 641
    View map
    Boston
    646-278-4921
    Atlanta
    516-324-9407
    Detroit
    312-590-2869

Claudia Perlich – m6d Chief Scientist – Wins Prestigious KDD Award

It is with great pride that we get to announce that our own Chief Scientist, Claudia Perlich, has once again taken home a coveted prize at this year’s ACM KDD conference.  Her paper, “Leakage in Data Mining: Formulation, Detection and Avoidance,” co-authored with two exceptional statisticians at Tel-Aviv University, has won the best paper award at the 2011 KDD conference.  The competition for this award was extraordinary — over 700 papers from many of the leading machine learning experts and data scientists worldwide.

Claudia is no stranger to winning contests at KDD, which is one of the world’s top data mining conferences, attended by both academia and top industry players (like Google, Yahoo, Microsoft, and now m6d!).  She has actually won their annual data mining competition three times in the past, and now sits on the committee that administers the competition.  Her current “best paper” is related to her experience winning these competitions.  She and her colleagues offer a formal analysis on a common pitfall in data mining & statistical analysis called “Leakage.”  According to the paper, “leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from.”  In other words, information related to the data you are trying to predict has “leaked” into the data you are using to make the prediction.  A trivial example that might illustrate this is as follows:

I am tasked to predict which prospects in a given pool will purchase a product online after being shown an ad for the product.  As the modeler, I pull all recent ad impressions from the data, and I use publisher, time of day and last site visited as my predictors.  I also pull in who has and hasn’t purchased, and who has and hasn’t visited the checkout page.  Now for those who purchased, the last site visited was the checkout page of the product being purchased.  If this was in my set of predictors, it would get a very high weight in my model, though in practice, this would be a useless model.  It is not feasible to target based on someone being on the checkout page, because the checkout page is the event that, by design, always precedes a purchase.

The above is a somewhat trivial example, but leakage is not a trivial problem.  As the paper points out, this problem has occurred in many data mining competitions, designed by highly qualified statisticians.  Kudos to Claudia and team for discovering the issue in these competitions and calling attention to the persistence of the problem.  It is refreshing to read a paper that offers practical guidance around such a subtle, but model-effacing, misapplication of proper modeling methodology.

Reflecting on how this relates to our work at m6d, I am thrilled to have such creative and intuitive colleagues.  We face new modeling challenges all the time, especially in such a fast moving and vast ecosystem.  Every algorithm starts with a team of people trying to solve a problem, and oftentimes these problems are so new that no textbook provides a how-to guide to designing an optimal solution.  In these situations, the algorithms are only as good as the statistical craftsmanship of the people who designed and planned them.  In our case, Claudia is one of the finest craftswoman in the field of data modeling and analysis, and so we are very fortunate to have her on our team.  And to our customers … never should you fear that leakage will ever corrupt your next campaign’s performance!