Moneyball: a simple Regression example

Moneyball book cover, from Wikimedia

The book (and later a movie) Moneyball by Michael Lewis tells the story of how the USA baseball team Oakland Athletics in 2002 leveraged the power of data instead of relying on experts.
Better data and better analysis of the data lead to find and use market inefficiencies.

The team was one of the poorest in a period when only rich teams could afford the all-star players (the imbalance in total salaries being something like 4 to 1).
A new ownership in 1995 was improving the team’s wins but in 2001 the loss of 3 key players and budget cuts were bringing a new idea: take a quantitative approach and find undervalued players.

The traditional way to select players was through scouting but Oakland and his general manager Billy Bean (Brad Pitt in the movie…) selected the players based on their statistics without any prejudice. Specifically, his assistant – the Harvard graduate Paul DePodesta looked at the data to find which ones were the undervalued skills.

A huge repository for the USA baseball statistics (called Sabermetrics) is the Lahman’s Baseball Database. This database contains complete batting and pitching statistics from 1871 to present plus fielding statistics, standings, team stats, managerial records, post-season data and more.

A subset of the data is stored in a CSV file that I put on my GitHub account and can be read into a Python data frame:

In [1]: baseballData = pd.read_csv("baseball.csv")
In [2]: baseballData.columns
 Index(['Team', 'League', 'Year', 'RS', 'RA', 'W', 'OBP', 'SLG', 'BA',
 'Playoffs', 'RankSeason', 'RankPlayoffs', 'G', 'OOBP', 'OSLG'],

The feature named ‘W’ is the wins in the season ‘Year’ and the feature ‘Playoffs’ is a boolean value (0 = didn’t make to the playoffs; 1 = did make).

In [3]: baseballData[['Team','Year','W','Playoffs']].head()
   Team Year W  Playoffs
 0 ARI  2012 81 0
 1 ATL  2012 94 1
 2 BAL  2012 93 1
 3 BOS  2012 69 0
 4 CHC  2012 61 0

The first step was to understand what was needed for the team to enter the playoffs and he judged from the past seasons that it takes 95 wins to be sure to make to the playoffs.


To win games a team needs to score more “runs” than their opponent but how many? DePodesta used linear regression to find out. We can see how.

The feature ‘RS’ is the number of runs scored and ‘RA’ is the number of runs allowed.
We can add an additional feature which summarise both of them by calculating the difference:

# add Run Difference ("RD") column
In [4]: baseballData.RD = baseballData.RS - baseballData.RA
In [5]: baseballData[['Team','RS','RA','RD','W']].head()
   Team RS  RA   RD  W
 0 ARI  734 688  46  81
 1 ATL  700 600  100 94
 2 BAL  712 705  7   93
 3 BOS  734 806 -72  69
 4 CHC  613 759 -146 61

Our output would be the number of wins W and we want to see how it depends on the runs difference RD:

W = beta0 + beta1 * RD

This is a linear regression and the beta parameters can be estimated using the Gradient Descent algorithm, as in this code snippet:

    # we use the Run Difference as feature for our regression
features = baseballData[['RD']]
    # Add a column of ones (y intercept)
features.insert(0, 'intercept', np.ones(m))

    # Values are the wins
values = baseballData.W
    # Initialize beta with zeroes, can also be random values
beta = np.zeros(len(features.columns))

print ("=== Linear Regression Moneyball example ===")
   # Set appropriate values for alpha, epsilon, iterations.
   # please feel free to change these value
beta, cost_history = gradientDescent(np.array(features),
                                     max_iterations = 40000)

print ("Beta parameters are: ", beta_gradient_descent)

And this is the result:

=== Linear Regression Moneyball example ===
 *** converged after iterations: 32418
 Beta parameters are: [ 80.904221   0.10454822]

This would mean that the formula is: W = 80.904221 + 0.104548 * RD

and to reach 95 wins it would need RD = 95 -80.9 / 0.1 = 135 runs.

The goal of a baseball team is then: make to playoffs < win at least 95 games < score at least 135 more runs than the opponent.

Finally, they used multivariate regression to check which baseball statistics were significantly to increase the runs scored and decrease the runs allowed and their conclusion was these two:

  • on-base percentage (percentage of time a player gets on base)
  • slugging percentage (how far a player gets around the bases on his turn).

They looked for players with statistics which would improve their average team’s and in 2002 the Oakland made to the playoffs winning 103 games, scoring 800 runs and allowing 653 (the difference is 147).


One thought on “Moneyball: a simple Regression example

  1. Pingback: Moneyball again: a multiple linear regressione xample – Look back in respect

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s