We have seen how to fit a linear regression model, with multiple features, how we can interpret the results and how accurately we can predict.
One issue in linear regression with multiple features was which one to use. Not all available variables should be used: beside the overfitting problem, each new variable requires more data. We will see now how to appropriately choose variables.
Following is an example taken from the masterpiece book Introduction to Statistical Learning by Hastie, Witten, Tibhirani, James. It is based on an Advertising Dataset, available on the accompanying web site or on Kaggle.
The dataset contains statistics about the sales of a product in 200 different markets, together with advertising budgets in each of these markets for different media channels: TV, radio and newspaper.
Imaging being the Marketing responsible and you need to prepare a new advertising plan for next year. You may be interested in answering questions such as:
Which media contribute to sales?
Do all three media—TV, radio, and newspaper—contribute to sales, or do just one or two of the media contribute?
Suppose that in your role are asked to suggest, on the basis of these data, a marketing plan for next year that will result in high product sales. What information would be useful in order to provide such a recommendation?
As usual, this example is also available as Jupyter notebook in Github.
Note: I will use interchangeably the words feature, variable or predictor to indicate the x_i inputs of the linear regression. Continue reading “Features selection for linear regression”