Stepwise selection of variables in regression is Evil

jonathan_landy

>> Some of the problems don’t matter as much if your goal for the model is just prediction, not interpretation of the model and its coefficients. But most of the time that I see the method used (including recent examples being distributed by so-called experts as part of their online teaching), the end model is indeed used for interpretation, and I have no doubt this is also the case with much published science. Further, even when the goal is only prediction, there are better methods like the Lasso, of dealing with a problem of a high number of variables.

I use this method often for prediction applications. First, it’s a sort of hyper parameter selection, so you should obviously use a holdout and test set to help you make a good choice.

Second, I often see the method dogmatically shut down like this, in favor of lasso. Yet every time I have compared the two they give similar selections — so how can one be “evil” and the other so glorified? I prefer the stepwise method though as you can visualize the benefit of adding in each additional feature. That can help to guide further feature development — a point that I’ve seen significantly lift the bottom line of enterprise scale companies.

ianbooker

There are many concepts called stepwise regression, its so weird that statistics, as a field, are so bad in delineating concepts.

I teach my students what you see in most social science papers, and in the light of the article at hand, I would call it "stepwise presentation of multivariate regression".

When it comes to the task of explaning, I think presenting different models, with a discussion what your pick is, provides good value to readers.

That said, I agree to the sentiment of the article but not the wording. "Blind" or manipulative stepwise deletion will decrease falsifiability of your work. That should be more provocative to scientists than evil.

specproc

I came to stats through the social sciences, and I get the feeling that's where the author's beef is coming from.

If you're doing say, econometrics for publication and you don't have a solid theoretical basis for all your variables, you're just flapping about.

There are however, plenty of other use cases where this sort of approach may be valid. The author mentions prediction, but regression summaries are really helpful tools in a variety of domains.

djoldman

Bold headline and then the stepping back later down the article:

> Some of the problems don’t matter as much if your goal for the model is just prediction, not interpretation of the model and its coefficients.

It depends strikes again.

> As my professor once told our class:

> "If you choose the variables in your model based on the data and then run tests on them, you are Evil; and you will go to Hell."

Again, it depends. If one wants to be pedantic, then choosing to omit any variable in the universe in a non-random fashion is based on some data. Choosing variables via "subject matter expertise" is the same thing but "smarter."

> To explore this I wrote a function (code a little way further down the blog) to simulate data with 15 X correlated variables and 1 y variable.

> mod <- lm(y ~ ., data = d)

If one wants to be pedantic, then why are we using a model that assumes independent variables with correlated data?