We have seen a linear regression example as told in the book / movie Moneyball: how a baseball team was able to leverage the power of data analysis to compete with richer teams.

We used the Gradient Descent algorithm to see how the number of wins in a season linearly depends on the number of runs scored and allowed.

We can extend that example, with the help of one of the most used Python library : sklearn; short for Scientific Kit LEARN, a Machine Learning dedicated tool, designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

The function LinearRegression fits a linear model to minimise the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. The method used is the Ordinary Least Squares.

moneyball is the dataset containing the necessary data, refer to the previous post for details.

Here is an extract of the Jupyter notebook, that is available on Github.

# Regression model to predict wins

The first step was to understand what was needed for the team to enter the playoffs and he judged from the past seasons that it takes 95 wins to be reasonably sure to make to the playoffs.

from sklearn import linear_model WinsModel = linear_model.LinearRegression()

The input variable is the Runs Difference (RD = Run Scored – Runs Allowed), and the output is the Wins:

features = moneyball[['RD']] # input features.insert(0, 'intercept', 1) WinsModel.fit(features, moneyball.W)

To see the fitted beta parameters we can examine the following values:

In [1]: WinsModel.intercept_ Out[1]: 80.881374722838132 In [2]: WinsModel.coef_ Out[2]: array([ 0. , 0.10576562])

Therefore the prediction formula for the number of wins is:

Wins = 80.9042 + 0.1045 * RD

To get the number of Runs necessary we need to solve this simple equation:

95 = 80.9042 + 0.1045 * RD

RD = (95-80.9042)/0.1045 = 134.89

135 Runs are necessary for 95 wins.

We can verify it by predicting the wins for 135 Runs:

In [3]: WinsModel.predict([[1, 135]]) Out[3]: array([ 95.15973375])

## The goal: score 135+ runs more than opponent

How does a team increase the runs difference (RD)?

There are two ways: either scoring more runs (RS) or allowing less runs (RA).

The A’s started using a different method to select players, based on their statistics, not on their looks.

Most teams focused on Batting Average (BA): getting on base by hitting the ball.

The A’s discovered that BA was overvalued and two baseball statistics were significantly more important than anything else:

- On-Base Percentage (OBP)

Percentage of time a player gets on base (including walks) - Slugging Percentage (SLG)

How far a player gets around the bases on his turn (measures power)

We can use linear regression to verify which baseball player features are more important tot predict runs.

# Regression model to predict runs scored

RSmodel = linear_model.LinearRegression() features = moneyball[['OBP', 'SLG', 'BA']] # input features.insert(0, 'intercept', 1) RSmodel.fit(features, moneyball.RS)

The score() function returns the R2 metric for the model:

In [4]: RSmodel.score(features, moneyball.RS) Out[4]: 0.93020162587862809

Which is very high, almost the maximum.

And if we remove the BA feature?

RSmodel2 = linear_model.LinearRegression() features = moneyball[['OBP', 'SLG']] # input features.insert(0, 'intercept', 1) RSmodel2.fit(features, moneyball.RS) In [5]: RSmodel2.score(features, moneyball.RS) Out[5]: 0.92958106080965974

There is almost no difference in the R2 when removing the BA feature.

A sign that BA is not significant.

Batting Average (BA) is overvalued.

On Base Percentage (OBP) and Slugging Percentage (SLG) are enough.

# 2002 Prediction

We know that the A’s Team OBP in 2002 is 0.339 and the Team SLG is 0.430

How many Runs and Wins can we expect?

We can just put these values in the predict() function:

In [6]: RSmodel2.predict([[1, 0.339, 0.43]]) Out[6]: array([ 804.98699884])

We predict 805 Runs scored in 2002.

In the same way we can predict that the Runs Allowed will be 622.

And finally the number of wins:

In [7]: WinsReg.predict([[1, (805-622)]]) Out[7]: array([ 100.23648363])

So our prediction for A’s team in 2002 is 100 wins in total, that would be probably enough to access the playoff.

# Final results

These are 2002 final results:

Actual runs score were 800 instead of 805

Runs allowed were 653 instead of 622 (note the bigger difference)

And wins were 103 instead of 100

They made it to the playoffs

Models (even relatively simple models) allow managers to more accurately value players and minimise risk.

Every major league baseball team now has a statistics group.

Analytics are used in other sports, too.

Pingback: Inference statistics for linear regression – Look back in respect