Keywords
MLB, Baseball, COVID-19, Neural Network, Logit
This article is included in the Artificial Intelligence and Machine Learning gateway.
This article is included in the Coronavirus collection.
MLB, Baseball, COVID-19, Neural Network, Logit
The 2019–2020 pandemic from the novel coronavirus (COVID-19) has brought unprecedented countermeasures to every sector of the economy, including individuals, groups, institutions, and industries. The sports industry took one of the biggest hits, with all major leagues in the U.S. cancelling or halting their events. While these actions were necessary to address the public health concern, each segment is now floating various proposals to resume operations and give some relief to the significant portion of the economy that the sports industry comprises.
Major League Baseball (MLB) is likely to be the first American professional sporting league to resume, probably in May or June (Passan, 2020). Players are willing (although not all players agree to the method), the League is willing, the Arizona government is willing, and health professionals have approved a plan to move forward, known as the Arizona Plan. This plan calls for players, coaches, and staff to be quarantined in hotels around the Phoenix area, and to play in empty ballparks that include the ten Cactus League Spring Training parks, Chase Field, and other Phoenix ballparks. One interesting aspect of this arrangement is that the stadium size will not matter because there will not be any fans. This will be an opportunity for MLB to get back into the spotlight and accumulate massive television viewership that MLB has not seen in decades. The experience will be completely optimized for TV viewing, and so the league will finally be able to experiment with proposed rule changes, including removing mound visits to make the game go faster, adding a Robo Umpire, which has already been successfully tested last season via a partnership with the independent Atlantic League (Bogage, 2019), and an expanded roster giving players more rest due to the extremely hot temperatures of Phoenix. While all of this will alter predictions on who's going to the playoffs, probably the biggest impact that this plan will have on the games is the lack of the home field advantage (HFA): the advantage that the home team has over the visiting team due to the home team having fans, the familiarity of the home team to their own ball park, and the away team having to travel.
Baseball has been shown in previous studies to be less susceptible to the HFA effect than other professional sports (Edwards & Archambault, 1979; Gómez et al., 2011; Pollard et al., 2017). Despite this, there is a measurable home field advantage in baseball, as shown by Jones (2015); Jones (2018)). Building on this, we extend the analysis for the MLB under uncertainty of which scenario the League will be following for the 2020 season. In particular, we simulate the win-loss probabilities for three different scenarios as well as estimate the home advantage for each team using the past three seasons’ data. This estimation helps us understand how individual team performance is going to be impacted as they prepare for the 2020 season in the new circumstances.
We use the MLB 2017–2019 season data for the 30 teams represented in the league. The data were obtained from the MLB Advanced Media’s Baseball Savant Website using the Python package PyBaseball 1.0.4 (LeDoux, 2017/2020). The data shows that out of the 7,290 home games played during the 2017–2019 seasons, 3,881 (53.237%) resulted in wins and the remaining 3,409 (46.763%) resulted in a loss. Next, we seek to quantify the HFA’s role in this difference.
There are various techniques to calculate the home advantage depending on the sport, gender, league, and the nature of scoring (Jones, 2015). Pollard et al. (2017) use a general linear model to fit the home advantage. However, because we have a categorical variable of win or lose, we need to follow a non-linear approach. To test the hypothesis that teams have home-field advantage, we apply a logit regression model to predict the probability of winning as a function of home game dummy, team fixed effects, opponent fixed effects, and the win-loss records. We estimate the following regression equation:
where Wini is a dummy variable that takes the value 1 if the recorded game resulted in a win for the team-opponent pair, and zero otherwise; Homei accounts for home game; and Teami controls for the individual team fixed effects, and Oppi controls for the opponent fixed effects, Zi represents the win-loss percentage for the team as well as the opponent, εi stands for the error term. The HFA is calculated accounting for the team fixed effects as well as the opponent fixed effects by interacting home with team and opponent separately. We run the logit model on all the data, with Home=1 for the home team and equal to zero for the travelling team. Doing so separates the team fixed effects and home field advantage. The model in Equation 1 is used to estimate both the win probability and the HFA per team. The HFA is obtained by calculating the marginal effect (ME) of Home on the win probability for each team separately.A neural network model was also created to act as a robustness check for the logit win prediction model. The software to train the model is hosted on GitHub (Ehrlich, 2020a). We used the R package nnet 7.3–14 (Ripley & Venables, 2020) for the neural network platform, and trained and tuned the model with the R package caret 6.0–86 (Kuhn et al., 2020). We developed a simulator to estimate what might happen if: 1) The full 2020 season continued on in a parallel universe devoid of COVID-19; 2) MLB waits and is able to return and play half a season to packed stadiums around the All Star Break, which is assuming an extremely optimistic timeline of a return to normal life; 3) a full season is played without fans, which is likely the only way they will be able to play this season (i.e., the Arizona Plan). The simulation was executed 100 times, and the logit win prediction model was used as the basis for predicting each win. A random number between 0 and 1 was generated and checked against the win probability provided by the model. If the random number was below the probability, then the team won, otherwise the team lost.
The summary statistics of the training data is contained in Table 1. The logit results from Equation 1, without the fixed effects, are reported in Table 2. Both log odds ratios and the MEs are reported in this table. The results show that the individual regressors included in the model show plausible impacts. Looking at the log-odds ratios, home games and the home team’s previous win-loss percentage (WL%) are more likely to result in a win but the opponent’s WL% is less likely to result a loss for the home team. These results support the presence of the HFA. The right half of the table shows the MEs for each variable. We are mainly interested in the MEe for the Home variable, which is 0.064. This means, the marginal probability of winning a game at home versus away field goes up by 6.4%. This is the average HFA for all of the teams as a whole. The HFA for each team is presented in Figure 1. In our sample, PHI seems to have the highest home advantage and HOU seems to have the lowest (negative, in fact) home advantage.
Variable | Mean | SD | Min | Median | Max |
---|---|---|---|---|---|
Home | 0.500 | 0.500 | 0.000 | 0.500 | 1.000 |
Prev WL % | 0.500 | 0.114 | 0.000 | 0.500 | 1.000 |
PrevWL% Opp | 0.500 | 0.114 | 0.000 | 0.500 | 1.000 |
Season | 2018.000 | 0.816 | 2017.000 | 2018.000 | 2019.000 |
The model was trained using the 2017–2019 MLB regular season games. The schedule for the 2020 season was estimated using the schedule from the 2019 season. While the dates will be off slightly, the team pairings will be nearly the same. The wins and losses of the 100 simulations were added to form the result of the 2020 season. The overall results are visualized in Figure 2, while the divisional results shown in Table 3. Table 4 provides statistics calculated during each season and averaged. This includes the correlation between the full season and the half and no-fan seasons using both the overall rankings and the win-loss percent (WL%). The full seasons rank correlations are higher with the no-fans seasons (0.825) than the half seasons (0.735). The correlations using WL% is similar. The standard deviation of the predicted win probabilities is lower for the no-fans seasons (0.073) than the full (0.085) and half seasons (0.085). The home effect was correlated with the win probabilities’ standard deviations and is negative for the no fans seasons (-0.221). In other words, the higher the home effect, the lower the variance.
This neural network was also used as the win predictor in 100 simulations and the results are very similar to the logit win prediction model, which shows robustness in the simulation results. Table 4 shows the statistical results of both models. The correlation and standard deviation differences are approximately the same between the two models.
The results of the simulations are available as Extended data (Ehrlich, 2020b) and the code necessary for replicating the results, including training the models, are hosted on GitHub (Ehrlich, 2020a).
Based on the above results, since the team-home effect is symmetric between home and away, teams will not necessarily win or lose any additional games in neutral stadiums as teams with a high home field effect will lose more neutral games that would have been at home but will win more neutral games that would have been away. The greater the home-team ME, the less variance there will in of the predicted win probabilities. To verify this assumption, we calculated the correlation of HomeEffects and the standard deviation (SD) of win probabilities between a full (0.361) and no-fan season (-0.221). Since the home effect is symmetric for each team (the away field disadvantage = -the home field advantage), decreasing the variance does not affect the overall expected WL% for each team. However, the result of individual games will be different since home effect is asymmetric between teams. For example, if the Cubs (highest home effect in the NL Central) plays the Cardinals (lowest home effect in the NL Central), the Cubs will have a larger advantage playing at home then the Cardinals will have playing at home (besides team fixed effects). These differences are removed with the No-Fan scenario and the outcome will be solely based upon the talent of the teams. However, on average there only a slight change of overall WL% (or playoff berth), just the SD of the results (see Table 3). Without fans, any advantage (or disadvantage) from home field advantage, which cause higher levels of variance, is removed. This stabilizes the outcome based upon true team talent. As fewer games have been played, the half-season will have more upsets, but the SD is close to the same as the full season.
This paper analyzes the previous season MLB data to estimate the win-loss probabilities for the 2020 season for each of the 30 teams in the League using logit regressions and a neural network. The Arizona Plan’s neutralization of HFA would not significantly affect the overall outcome of the season. In fact, our model predicts that the Arizona Plan season will produce season results that are based more on the true talent of the teams. Further, our simulation demonstrates that there will be less variance in the win probability between any two teams, which we estimate will cause a larger divide between the best and worst teams. In conclusion, we believe that the results of the Arizona Plan will be similar to a regular season with fans, and that the teams’ standings at the end of the regular season will be more predictable than a normal season.
Zenodo: Syracuse-University-Sport-Analytics/MLBCovid19. https://doi.org/10.5281/zenodo.3775959 (Ehrlich, 2020a).
This project contains the following source data files:
data/2008_2019Games.csv. (Input data scraped using the scrapingMLB.ipynb.)
data/divisions.csv. (Input team division data for grouping by division.)
data/mlbTeamColors.csv. (Input team colors for the visualizations.)
Source data are also available on GitHub: https://github.com/Syracuse-University-Sport-Analytics/MLBCovid19.
Harvard Dataverse: Replication Data for: COVID-19 Countermeasures, Major League Baseball, and the Home Field Advantage. https://doi.org/10.7910/DVN/OOMWSD (Ehrlich, 2020b).
This project contains the following extended data files:
divisionRankings. (Results of simulation using the logit model.)
divisionRankingsNN. (Results of simulation using the neural network model.)
homeEffectLogit. (Team home effects using the logit model.)
homeEffectNN. (Team home effects using the neural network model.)
modelCorrelationsSummaryWithNNResults. (Simulation statistics from both the logit and neural network models.)
Zenodo: Syracuse-University-Sport-Analytics/MLBCovid19. https://doi.org/10.5281/zenodo.3775959 (Ehrlich, 2020a).
This project contains the following source files:
pythonStatcastScraper/scrapingMLB.ipynb. (Python Jupiter Notebook code for scraping Statcast.)
halfSeasonPrediction.Rmd. (R Markdown Notebook code for developing the logit and neural network models. Also contains the code for running the simulations.)
All the other data is intermediate output from the simulations. The important output files are located in the above Harvard Dataverse repository.
Source code is also available on GitHub: https://github.com/Syracuse-University-Sport-Analytics/MLBCovid19.
Mixed data and code hosted on GitHub and Zenodo are available under the terms of the GNU General Public License v3.0.
Data hosted on Harvard Dataverse are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
References
1. Schwartz B, Barsky S: The Home Advantage. Social Forces. 1977; 55 (3): 641-661 Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Mathematics, Sports, Social Choice
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Sports Medicine and exercice science
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Exercise and sports psychology.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 1 20 May 20 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)