Suppose We have a data set with millions rows and thousands columns and the task is binary classification. When we run a logistic regression model, the performance a lot better than expected, e.g, almost perfect classification.

We suspect there are some cheating variables in data, how can I quickly detect it?

Here cheating variables means a variable that is very indicative to the response and we should not use it. For example, we use if a person make customer service call to predict if a person purchased a product or not.

share|improve this question
2  
Your idea of a "cheating variable" here seems similar to the "correlation = causation" fallacy (or perhaps "postdiction"). But your proposal seems more like "How to determine if one predictor dominates outcomes?". This could be useful to flag variables for further QA/QC, but is not determinative. In your ending example, a timestamp on the data would be determinitive (i.e. call follows purchase, so "cheat" because causal arrow is backwards). – GeoMatt22 7 hours ago
    
@GeoMatt22 thanks for your comment, I admit the defection is not clear. I am also thinking if the definition should include linear combination of variables instead of one variable, and define how strong it is to treat as "cheating". – hxd1011 7 hours ago
    
I think that "strong association" cannot be used to infer "causal vs. cheat" in a purely logical manner. However in a Bayesian sense, the "too good to be true" prior does not seem worthless. But I am not sure how to formalize this :) (In a particular domain, I guess you could accumulate a "prior causal $R^2$ PDF"?) – GeoMatt22 7 hours ago
    
process note: you may wish to delay before posting an answer, to encourage feedback. (Or post a note about your intent, as I did here.) – GeoMatt22 7 hours ago
    
This is an interesting question: in practice, it's actually an important question. For example, if you set up a Kaggle competition and include a "cheating variable", you may greatly underestimate the difficulty of predicting the outcome based on competition results. In theory, it's not an important question: the real question is whether the variables will be available when it comes time for prediction. If they are available, use the cheating variable! But this is an interesting idea for double checking datasets before putting on Kaggle, as an example. – Cliff AB 6 hours ago

This is sometimes referred to as "Data Leakage." There's a nice paper on this here:

Leakage in Data Mining: Formulation, Detection, and Avoidance

The above paper has plenty of amusing (and horrifying) examples of data leakage, for example, a cancer prediction competition where it turned out that patient ID numbers had a near perfect prediction of future cancer, unintentionally because of how groups were formed throughout the study.

I don't think there's a clear cut way of identifying data leakage. The above paper has some suggestions but in general it's very problem specific. As an example, you could definitely look at just the correlations between your features and target. However, sometimes you'll miss things. For example, imagine you're making a spam bot detector for a website like stackexchange, where in addition to collection features like message length, content, etc., you can potentially collect information on whether a message was flagged by another user. However, if you want your bot detector to be as fast as possible, you shouldn't have to rely on user-generated message flags. Naturally, spam bots would accumulate a ton of user-generated message flags, so your classifier might start relying on these flags, and less so on the content of the messages. In this way you should consider removing flags as a feature so that you can tag bots faster than the crowd-sourced user effort, i.e. before a wide audience has been exposed to their messages.

Other times, you'll have a very stupid feature that's causing your detection. There's a nice anecdote here about a story on how the Army tried to make a tank detector, which had near perfect accuracy but, ended up detecting cloudy days instead because all the training images with tanks were taken on a cloudy day, and every training image without tanks was taken on a clear day. A very relevant paper on this is: "Why Should I trust you?": Explaining Predictions of Any Classifier - Ribeiro, et. al.

share|improve this answer
    
+1 thanks for answering my not well defined questions! and now I know the term and will read their formulation! – hxd1011 6 hours ago
    
On the last paragraph: the army apparently took that lesson to heart! – GeoMatt22 6 hours ago
1  
+1 The most common leak I run into is a leak from the future. This is why I'm not too worried about the new spate of "Just run your data through our algorithm and it'll make a model just like a Data Scientist!" products. Machine learning follows the signal, even if the signal is a leak. – Wayne 3 hours ago

One way of detecting cheating variables is building a tree model and look at first few splits. Here is a simulated example.

cheating_variable=runif(1e3)
x=matrix(runif(1e5),nrow = 1e3)
y=rbinom(1e3,1,cheating_variable)

d=data.frame(x=cbind(x,cheating_variable),y=y)

library(rpart)
library(partykit)
tree_fit=rpart(y~.,d)
plot(as.party(tree_fit))

enter image description here

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.