Feature selection

Sklearn F-test feature selection

Ref1

There won't be a difference if F-regressionjust computes the F statistic and pick the best features. There might be a difference in the ranking, assumingF-regressiondoes the following:

  • Start with a constant model, M0

  • Try all models M2 consisting of M1 plus one other feature and pick the best ...

  • Try all models M1 consisting of just one feature and pick the best according to the F statistic

As the correlation will not be the same at each iteration. But you can still get this ranking by just computing the correlation at each step, so why doesF-regressiontakes an additional step? It does two things:

  • Feature selection: If you want to select the k best features in a Machine learning pipeline, where you only care about accuracy and have measures to adjust under/overfitting, you might only care about the ranking and the additional computation is not useful.

  • Test for significance: If you are trying to understand the effect of some variables on an output in a study, you might want to build a linear model, and only include the variables that are significantly improving your model, with respect to some p-value. Here, F-regression comes in handy.

What is a F-test

A F-test (Wikipedia) is a way of comparing the significance of the improvement of a model, with respect to the addition of new variables. You can use it when have a basic modelM0M0and a more complicated model M1, which contains all variables from M0 and some more. The F-test tells you if M1 is significantly better than M0, with respect to a p-value.

To do so, it uses the residual sum of squares as an error measure, and compares the reduction in error with the number of variables added, and the number of observation (more details on Wikipedia). Adding variables, even if they are completely random, is expected to always help the model achieve lower error by adding another dimension. The goal is to figure out if the new features are _really _helpful or if they are random numbers but still help the model because they add a dimension.

What doesf_regressiondo

Note that I am not familiar with the Scikit learn implementation, but lets try to figure out whatf_regressionis doing. The documentation states that the procedure is sequential. If the word sequential means the same as in other statistical packages, such as Matlab Sequential Feature Selection, here is how I would expect it to proceed:

  • Start with a constant model, M0

  • Try all models M1 consisting of just one feature and pick the best according to the F statistic

  • Try all models M2 consisting of M1 plus one other feature and pick the best ...

For now, I think it is a close enough approximation to answer your question; is there a difference between the ranking of f_regression and ranking by correlation.

If you were to start with constant model M0 and try to find the best model with only one feature, M1, you will select the same feature whether you use f_regression or your correlation based approach, as they are both a measure of linear dependency. But if you were to go from M0 to M1 and then to M2, there would be a difference in your scoring.

Assume you have three features, x1,x2,x3, where both x1 and x2 are highly correlated with the output y, but also highly correlated with each other, while x3 is only midly correlated with y. Your method of scoring would assign the best scores to x1 and x2, but the sequential method might not. In the first round, it would pick the best feature, say x1, to create M1. Then, it would evaluate both x2 and x3 for M2. As x2 is highly correlated with an already selected feature, most of the information it contains is already incorporated into the model, and therefore the procedure might select x3. While it is less correlated to y, it is more correlated to the residuals, the part that x1 does not already explain, than x2. This is how the two procedure you propose are different.

You can still emulate the same effect with your idea by building your model sequentially and measuring the difference in gain for each additional feature instead of comparing them to the constant model M0 as you are doing now. The result would not be different from the f_regression results. The reason for this function to exists is to provide this sequential feature selection, and additionnaly converts the result to an F measure which you can use to judge significance.

The goal of the F-test is to provide significance level. If you want to make sure the features your are including are significant with respect to yourpp-value, you use an F-test. If you just want to include thekkbest features, you can use the correlation only.

Additional material: Here is an introduction to the F-test you might find helpful

Ref2 Comparison of F-test and mutual information As F-test captures only linear dependency, it rates x_1 as the most discriminative feature. On the other hand, mutual information can capture any kind of dependency between variables and it rates x_2 as the most discriminative feature, which probably agrees better with our intuitive perception for this example. Both methods correctly marks x_3 as irrelevant.

Last updated

Was this helpful?