I am an economics student with some experience with econometrics and R. I would like to know if there is ever a situation where we should include a variable in a regression in spite of it not being statistically significant?
Sign up
- Anybody can ask a question
- Anybody can answer
- The best answers are voted up and rise to the top
|
|
Yes! That a coefficient is statistically indistinguishable from zero does not imply that the coefficient actually is zero, that the coefficient is irrelevant. That an effect does not pass some arbitrary cutoff for statistical significance does not imply one should not attempt to control for it. Generally speaking, the problem at hand and your research design should guide what to include as regressors. Some Quick Examples:And do not take this as an exhaustive list. It's not hard to come up with tons more... 1. Fixed effectsA situation where this often occurs is a regression with fixed effects. Let's say you have panel data and want to estimate $b$ in the model: $$ y_{it} = b x_{it} + u_i + \epsilon_{it}$$ Estimating this model with ordinary least squares where $u_i$ are treated as fixed effects is equivalent to running ordinary least squares with an indicator variable for each individual $i$. Anyway, the point is that the $u_i$ variables (i.e. the coefficients on the indicator variables) are often poorly estimated. Any individual fixed effect $u_i$ is often statistically insignificant. But you still include all the indicator variables in the regression if you are taking account of fixed effects. (Further note that most stats packages won't even give you the standard errors for individual fixed effects when you use the built-in methods. You don't really care about significance of individual fixed effects. You probably do care about their collective significance.) 2. Functions that go together...(a) Polynomial curve fitting (hat tip @NickCox in the comments)If you're fitting a $k$th degree polynomial to some curve, you almost always include lower order polynomial terms. E.g. if you were fitting a 2nd order polynomial you would run: $$ y_i = b_0 + b_1 x_i + b_2 x_i^2 + \epsilon_i$$ Usually it would be quite bizarre to force $b_1 = 0$ and instead run $$ y_i = b_0 + b_2 x_i^2 + \epsilon_i$$ but students of Newtonian mechanics will be able to imagine exceptions. (b) AR(p) models:Let's say you were estimating an AR(p) model you would also include the lower order terms. For example for an AR(2) you would run: $$ y_t = b_0 + b_1 y_{t-1} + b_2 y_{t-2} + \epsilon_t$$ And it would be bizarre to run: $$ y_t = b_0 + b_2 y_{t-2} + \epsilon_t$$ (c) Trigonometric functionsAs @NickCox mentions, $\cos$ and $\sin$ terms similarly tend to go together. For more on that, see e.g. this paper. More broadly...You want to include right-hand side variables when there are good theoretical reasons to do so. And as other answers here and across StackExchange discuss, step-wise variable selection can create numerous statistical problems. It's also important to distinguish between:
In the latter case, it's problematic to argue the coefficient doesn't matter. It may simply be poorly measured. |
|||||||||||||||||||||
|
|
Yes, there are. Any variable that could correlate with your response variable in a meaningful way, even at a statistically insignificant level, could confound your regression if it is not included. This is known as underspecification, and leads to parameter estimates that are not as accurate as they could otherwise be. https://onlinecourses.science.psu.edu/stat501/node/328 From the above:
|
|||||||||||||
|
|
Usually you do not include or exclude variables for linear regression because of their significance. You include them because you assume that the selected variables are (good) predictors of the regression criteria. In other words, the predictor selection is based on theory. Statistical insignificance in linear regression can mean two things (of which I know):
A valid reason to exclude insignificant predictors is that you are looking for the smallest subset of predictors that explain the criteria variance or most of it. If you have found it check your theory. |
|||||
|
|
In econometrics this happens left and right. For instance, if you are using quarterly seasonality dummies Q2,Q3, and Q4, it happens often that as a group they're significant, but some of them are not significant individually. In this case you usually keep them all. Another typical case is interactions. Consider a model $y\sim x*z$, where main effect $z$ is not significant but the interaction $x*z$ is. In this case it's customary to keep the main effect. There are many reasons why you should not drop it, and some of them were discussed in the forum. UPDATE: Another common example is forecasting. Econometrics is usually taught from inference perspective in economics departments. In inference perspective a lot of attention is on p-values and significance, because you're trying to understand what causes what and so on. In forecasting, there's not much emphasis on this stuff, because all you care is how well the model can forecast the variable of interest. This is similar to machine learning applications, btw, which are making their way into economics recently. You can have a model with all significant variables that doesn't forecast well. In ML it's often associated with so called "over fitting". There's very little use of such model in forecasting, obviously. |
|||||||||
|
|
You are asking two different questions:
Edit: this was true about the original post, but might no longer be true after the edits. Regarding Q1, I think it is on the border of being too broad. There are many possible answers, some already provided. One more example is when building models for forecasting (see the source cited below for an explanation). Regarding Q2, statistical significance is not a sound criterion for model building. Rob J. Hyndman writes the following in his blog post "Statistical tests for variable selection":
Also note that you can often find some variables that are statistically significant purely by chance (the chance being controlled by your choice of the significance level). The observation that a variable is statistically significant is not enough to conclude that the variable belongs in the model. |
||||
|
|
|
I'll add another "yes". I've always been taught -- and I've tried to pass it along -- that the primary consideration in covariate choice is domain knowledge, not statistics. In biostatistics, for instance, if I'm modelling some health outcome on individuals, then no matter what the regression says, you'll need some darn good arguments for me not to include age, race, and sex in the model. It also depends on the purpose of your model. If the purpose is gaining better understanding of what factors are most associated with your outcome, then building a parsimonious model has some virtues. If you care about prediction, and not so much about understanding, then eliminating covariates may be a smaller concern. (Finally, if you're planning to use statistics for variable selection, check out what Frank Harrell has to say on the subject -- http://www.stata.com/support/faqs/statistics/stepwise-regression-problems/, and his book Regression Modeling Strategies. Briefly, by the time you've used stepwise or similar statistically-based strategies for choosing the best predictors, then any tests of "are these good predictors?" are terribly biased -- of course they're good predictors, you've chosen them on that basis, and so the p values for those predictors are falsely low.) |
|||||
|
|
The only thing that the result of "statistical insignificance" truly says is that, at the selected level of Type I error, we cannot even tell whether the effect of the regressor on the dependent variable is positive or negative (see this post). So, if we keep this regressor, any discussion about its own effect on the dependent variable does not have statistical evidence to back it up. But this estimation failure does not say that the regressor does not belong to the structural relation, it only says that with the specific data set we were unable to determine with some certainty the sign of its coefficient. So in principle, if there are theoretical arguments that support its presence, the regressor should be kept. Other answers here provided specific models/situations for which such regressors are kept in the specification, for example the answer mentioning the fixed-effects panel data model. |
|||||||||
|
|
You may include a variable of particular interest if it is the focus of research, even if not statistically significant. Also, in biostatistics, clinical significance is often different than statistical significance. |
||||
|
|
protected by Nick Cox yesterday
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?