The book (and later a movie) Moneyball by Michael Lewis tells the story of how the USA baseball team Oakland Athletics in 2002 leveraged the power of data instead of relying on experts.
Better data and better analysis of the data lead to find and use market inefficiencies.
The team was one of the poorest in a period when only rich teams could afford the all-star players (the imbalance in total salaries being something like 4 to 1).
A new ownership in 1995 was improving the team’s wins but in 2001 the loss of 3 key players and budget cuts were bringing a new idea: take a quantitative approach and find undervalued players.
The traditional way to select players was through scouting but Oakland and his general manager Billy Bean (Brad Pitt in the movie…) selected the players based on their statistics without any prejudice. Specifically, his assistant – the Harvard graduate Paul DePodesta looked at the data to find which ones were the undervalued skills.
A huge repository for the USA baseball statistics (called Sabermetrics) is the Lahman’s Baseball Database. This database contains complete batting and pitching statistics from 1871 to present plus fielding statistics, standings, team stats, managerial records, post-season data and more.
A subset of the data is stored in a CSV file that I put on my GitHub account and can be read into a Python data frame:
In : baseballData = pd.read_csv("baseball.csv") In : baseballData.columns Out: Index(['Team', 'League', 'Year', 'RS', 'RA', 'W', 'OBP', 'SLG', 'BA', 'Playoffs', 'RankSeason', 'RankPlayoffs', 'G', 'OOBP', 'OSLG'], dtype='object')
The feature named ‘W’ is the wins in the season ‘Year’ and the feature ‘Playoffs’ is a boolean value (0 = didn’t make to the playoffs; 1 = did make).
In : baseballData[['Team','Year','W','Playoffs']].head() Out: Team Year W Playoffs 0 ARI 2012 81 0 1 ATL 2012 94 1 2 BAL 2012 93 1 3 BOS 2012 69 0 4 CHC 2012 61 0
The first step was to understand what was needed for the team to enter the playoffs and he judged from the past seasons that it takes 95 wins to be sure to make to the playoffs.
To win games a team needs to score more “runs” than their opponent but how many? DePodesta used linear regression to find out. We can see how.
The feature ‘RS’ is the number of runs scored and ‘RA’ is the number of runs allowed.
We can add an additional feature which summarise both of them by calculating the difference:
# add Run Difference ("RD") column In : baseballData.RD = baseballData.RS - baseballData.RA In : baseballData[['Team','RS','RA','RD','W']].head() Out: Team RS RA RD W 0 ARI 734 688 46 81 1 ATL 700 600 100 94 2 BAL 712 705 7 93 3 BOS 734 806 -72 69 4 CHC 613 759 -146 61
Our output would be the number of wins W and we want to see how it depends on the runs difference RD:
W = beta0 + beta1 * RD
This is a linear regression and the beta parameters can be estimated using the Gradient Descent algorithm, as in this code snippet:
# we use the Run Difference as feature for our regression features = baseballData[['RD']] # Add a column of ones (y intercept) features.insert(0, 'intercept', np.ones(m)) # Values are the wins values = baseballData.W # Initialize beta with zeroes, can also be random values beta = np.zeros(len(features.columns)) print ("=== Linear Regression Moneyball example ===") # Set appropriate values for alpha, epsilon, iterations. # please feel free to change these value beta, cost_history = gradientDescent(np.array(features), np.array(values), beta, max_iterations = 40000) print ("Beta parameters are: ", beta_gradient_descent)
And this is the result:
=== Linear Regression Moneyball example === *** converged after iterations: 32418 Beta parameters are: [ 80.904221 0.10454822]
This would mean that the formula is: W = 80.904221 + 0.104548 * RD
and to reach 95 wins it would need RD = 95 -80.9 / 0.1 = 135 runs.
The goal of a baseball team is then: make to playoffs < win at least 95 games < score at least 135 more runs than the opponent.
Finally, they used multivariate regression to check which baseball statistics were significantly to increase the runs scored and decrease the runs allowed and their conclusion was these two:
- on-base percentage (percentage of time a player gets on base)
- slugging percentage (how far a player gets around the bases on his turn).
They looked for players with statistics which would improve their average team’s and in 2002 the Oakland made to the playoffs winning 103 games, scoring 800 runs and allowing 653 (the difference is 147).