ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

COVID-19 countermeasures, Major League Baseball, and the home field advantage: Simulating the 2020 season using logit regression and a neural network

[version 1; peer review: 3 approved with reservations]
PUBLISHED 20 May 2020
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the Coronavirus collection.

Abstract

Background: In the wake of COVID-19, almost all major league sports have been either cancelled or postponed. The sports industry suffered a major blow with the uncertainty of sporting events being held in the near future. Various scenarios of how and when sports might recommence have been discussed. This paper examines various scenarios of how Major League Baseball team performance is going to be impacted by the presence of fans, or the lack thereof, in the context of physical distancing and other COVID-19 countermeasures
Methods: The paper simulates, using a neural network and a logit regression model, the win-loss probabilities for various scenarios under consideration and also estimates the home effect for each team using data for the 2017-2019 seasons.
Results: The model demonstrates that individual team home effect is symmetric between home and away and teams will not necessarily have a win or loss of any additional games in neutral stadiums, as teams with a high home field effect will lose more neutral games that would have been at home but will win more neutral games that would have been away. However, the result of individual games will be different since home effect is asymmetric between teams. Our simulation demonstrates that these individual game differences may lead to a slight difference in Play-Off Berths between a full season, a half season, or a full season without fans.
Conclusions: Without fans, any advantage (or disadvantage) from home field advantage is removed. Our models and simulation demonstrate that this will reduce the variance. This stabilizes the outcome based upon true team talent, which we estimate will cause a larger divide between the best and worst teams. This estimation helps decision makers understand how individual team performance will be impacted as they prepare for the 2020 season under the new circumstances.

Keywords

MLB, Baseball, COVID-19, Neural Network, Logit

Introduction

The 2019–2020 pandemic from the novel coronavirus (COVID-19) has brought unprecedented countermeasures to every sector of the economy, including individuals, groups, institutions, and industries. The sports industry took one of the biggest hits, with all major leagues in the U.S. cancelling or halting their events. While these actions were necessary to address the public health concern, each segment is now floating various proposals to resume operations and give some relief to the significant portion of the economy that the sports industry comprises.

Major League Baseball (MLB) is likely to be the first American professional sporting league to resume, probably in May or June (Passan, 2020). Players are willing (although not all players agree to the method), the League is willing, the Arizona government is willing, and health professionals have approved a plan to move forward, known as the Arizona Plan. This plan calls for players, coaches, and staff to be quarantined in hotels around the Phoenix area, and to play in empty ballparks that include the ten Cactus League Spring Training parks, Chase Field, and other Phoenix ballparks. One interesting aspect of this arrangement is that the stadium size will not matter because there will not be any fans. This will be an opportunity for MLB to get back into the spotlight and accumulate massive television viewership that MLB has not seen in decades. The experience will be completely optimized for TV viewing, and so the league will finally be able to experiment with proposed rule changes, including removing mound visits to make the game go faster, adding a Robo Umpire, which has already been successfully tested last season via a partnership with the independent Atlantic League (Bogage, 2019), and an expanded roster giving players more rest due to the extremely hot temperatures of Phoenix. While all of this will alter predictions on who's going to the playoffs, probably the biggest impact that this plan will have on the games is the lack of the home field advantage (HFA): the advantage that the home team has over the visiting team due to the home team having fans, the familiarity of the home team to their own ball park, and the away team having to travel.

Baseball has been shown in previous studies to be less susceptible to the HFA effect than other professional sports (Edwards & Archambault, 1979; Gómez et al., 2011; Pollard et al., 2017). Despite this, there is a measurable home field advantage in baseball, as shown by Jones (2015); Jones (2018)). Building on this, we extend the analysis for the MLB under uncertainty of which scenario the League will be following for the 2020 season. In particular, we simulate the win-loss probabilities for three different scenarios as well as estimate the home advantage for each team using the past three seasons’ data. This estimation helps us understand how individual team performance is going to be impacted as they prepare for the 2020 season in the new circumstances.

Methods

Data sources

We use the MLB 2017–2019 season data for the 30 teams represented in the league. The data were obtained from the MLB Advanced Media’s Baseball Savant Website using the Python package PyBaseball 1.0.4 (LeDoux, 2017/2020). The data shows that out of the 7,290 home games played during the 2017–2019 seasons, 3,881 (53.237%) resulted in wins and the remaining 3,409 (46.763%) resulted in a loss. Next, we seek to quantify the HFA’s role in this difference.

Calculating home advantage

There are various techniques to calculate the home advantage depending on the sport, gender, league, and the nature of scoring (Jones, 2015). Pollard et al. (2017) use a general linear model to fit the home advantage. However, because we have a categorical variable of win or lose, we need to follow a non-linear approach. To test the hypothesis that teams have home-field advantage, we apply a logit regression model to predict the probability of winning as a function of home game dummy, team fixed effects, opponent fixed effects, and the win-loss records. We estimate the following regression equation:

Wini=αi+α1Homei+α2Teami+α3Home×Teami+α3Home×Oppi+ηZi+εi(1)
where Wini is a dummy variable that takes the value 1 if the recorded game resulted in a win for the team-opponent pair, and zero otherwise; Homei accounts for home game; and Teami controls for the individual team fixed effects, and Oppi controls for the opponent fixed effects, Zi represents the win-loss percentage for the team as well as the opponent, εi stands for the error term. The HFA is calculated accounting for the team fixed effects as well as the opponent fixed effects by interacting home with team and opponent separately. We run the logit model on all the data, with Home=1 for the home team and equal to zero for the travelling team. Doing so separates the team fixed effects and home field advantage. The model in Equation 1 is used to estimate both the win probability and the HFA per team. The HFA is obtained by calculating the marginal effect (ME) of Home on the win probability for each team separately.

Development of the neural network model

A neural network model was also created to act as a robustness check for the logit win prediction model. The software to train the model is hosted on GitHub (Ehrlich, 2020a). We used the R package nnet 7.3–14 (Ripley & Venables, 2020) for the neural network platform, and trained and tuned the model with the R package caret 6.0–86 (Kuhn et al., 2020). We developed a simulator to estimate what might happen if: 1) The full 2020 season continued on in a parallel universe devoid of COVID-19; 2) MLB waits and is able to return and play half a season to packed stadiums around the All Star Break, which is assuming an extremely optimistic timeline of a return to normal life; 3) a full season is played without fans, which is likely the only way they will be able to play this season (i.e., the Arizona Plan). The simulation was executed 100 times, and the logit win prediction model was used as the basis for predicting each win. A random number between 0 and 1 was generated and checked against the win probability provided by the model. If the random number was below the probability, then the team won, otherwise the team lost.

Results

The summary statistics of the training data is contained in Table 1. The logit results from Equation 1, without the fixed effects, are reported in Table 2. Both log odds ratios and the MEs are reported in this table. The results show that the individual regressors included in the model show plausible impacts. Looking at the log-odds ratios, home games and the home team’s previous win-loss percentage (WL%) are more likely to result in a win but the opponent’s WL% is less likely to result a loss for the home team. These results support the presence of the HFA. The right half of the table shows the MEs for each variable. We are mainly interested in the MEe for the Home variable, which is 0.064. This means, the marginal probability of winning a game at home versus away field goes up by 6.4%. This is the average HFA for all of the teams as a whole. The HFA for each team is presented in Figure 1. In our sample, PHI seems to have the highest home advantage and HOU seems to have the lowest (negative, in fact) home advantage.

Table 1. Summary statistics used for training the model.

VariableMeanSDMinMedianMax
Home0.5000.5000.0000.5001.000
Prev WL %0.5000.1140.0000.5001.000
PrevWL% Opp0.5000.1140.0000.5001.000
Season2018.0000.8162017.0002018.0002019.000

Table 2. Results of regression analysis.

Logit win prediction modelLogit win prediction
marginal effects
PredictorsOdds ratiosStd. errorpAMEStd. errorp
(Intercept)0.8760.1740.447
Home1.3040.034<0.0010.0640.008<0.001
PrevWLPerc2.3070.167<0.0010.2030.040<0.001
PrevWLPercOpp0.4330.167<0.001-0.2030.040<0.001
WLPercSeasonPrev8.2510.245<0.0010.5130.060<0.001
WLPercSeasonPrevOpp0.1210.245<0.001-0.5120.060<0.001
Observations14580
R2 Tjur0.029
4a309511-5dba-4b76-982e-b818296726ce_figure1.gif

Figure 1. MLB home field advantage effect of individual teams.

The model was trained using the 2017–2019 MLB regular season games. The schedule for the 2020 season was estimated using the schedule from the 2019 season. While the dates will be off slightly, the team pairings will be nearly the same. The wins and losses of the 100 simulations were added to form the result of the 2020 season. The overall results are visualized in Figure 2, while the divisional results shown in Table 3. Table 4 provides statistics calculated during each season and averaged. This includes the correlation between the full season and the half and no-fan seasons using both the overall rankings and the win-loss percent (WL%). The full seasons rank correlations are higher with the no-fans seasons (0.825) than the half seasons (0.735). The correlations using WL% is similar. The standard deviation of the predicted win probabilities is lower for the no-fans seasons (0.073) than the full (0.085) and half seasons (0.085). The home effect was correlated with the win probabilities’ standard deviations and is negative for the no fans seasons (-0.221). In other words, the higher the home effect, the lower the variance.

4a309511-5dba-4b76-982e-b818296726ce_figure2.gif

Figure 2. MLB season 2020 change in simulated rank after 100 simulations.

Table 3. Results of the simulation using the logit win prediction model.

No fan
season
Full
season
Half
season
DivisionTmRankBerthWL%RankBerthWL%RankBerthWL%Home
effect
AL CentralCLE1y0.5931y0.5871y0.5840.034
AL CentralMIN2w0.57320.5652w0.5720.033
AL CentralCHW30.41930.41430.4250.065
AL CentralKCR40.39040.39740.4020.062
AL CentralDET50.33750.34050.3410.050
AL EastNYY1y0.6231y0.6141y0.6070.115
AL EastBOS20.5692w0.57330.5680.003
AL EastTBR30.56730.5642w0.5820.069
AL EastTOR40.43840.43540.4080.072
AL EastBAL50.34750.34950.3570.092
AL WestHOU1y0.6531y0.6501y0.657-0.014
AL WestOAK2w0.5802w0.57620.5660.115
AL WestSEA30.47330.46830.4910.016
AL WestLAA40.46940.46740.4620.056
AL WestTEX50.46350.45450.4360.070
NL CentralMIL1y0.5651y0.5551y0.5530.071
NL CentralCHC2w0.5442w0.54520.5500.118
NL CentralSTL30.53430.54230.5440.056
NL CentralPIT40.45040.45140.4640.084
NL CentralCIN50.43750.44550.4400.099
NL EastWSN1y0.5691y0.5652w0.5560.016
NL EastATL2w0.5592w0.5591y0.558-0.005
NL EastNYM30.49430.48530.5050.045
NL EastPHI40.47540.47740.4790.163
NL EastMIA50.39450.39350.3920.107
NL WestLAD1y0.6321y0.6321y0.6340.081
NL WestARI20.54020.5392w0.5530.049
NL WestCOL30.50030.49930.4880.092
NL WestSFG40.43740.44140.4370.071
NL WestSDP50.43250.42350.4020.050

Table 4. Key summary statistics of the simulation using the logit win prediction model.

Model simulatedLogitNN
Simulated Seasons100100
Full-NoFans Rank Correlation0.8250.814
Full-Half Rank Correlation0.7350.719
Full-NoFans WL% Correlation0.8230.817
Full-Half WL% Correlation0.7340.718
NoFans WL% SD0.0730.073
Full WL% SD0.0850.086
Half WL% SD0.0850.086
Full WinProb SD-HomeEffect
Correlation
0.3610.251
NoFans WinProb SD – HomeEffect
Correlation
-0.221-0.268

Note: These statistics are calculated for each season and averaged. WL% is predicted based upon the Win Prediction.

This neural network was also used as the win predictor in 100 simulations and the results are very similar to the logit win prediction model, which shows robustness in the simulation results. Table 4 shows the statistical results of both models. The correlation and standard deviation differences are approximately the same between the two models.

The results of the simulations are available as Extended data (Ehrlich, 2020b) and the code necessary for replicating the results, including training the models, are hosted on GitHub (Ehrlich, 2020a).

Discussion

Based on the above results, since the team-home effect is symmetric between home and away, teams will not necessarily win or lose any additional games in neutral stadiums as teams with a high home field effect will lose more neutral games that would have been at home but will win more neutral games that would have been away. The greater the home-team ME, the less variance there will in of the predicted win probabilities. To verify this assumption, we calculated the correlation of HomeEffects and the standard deviation (SD) of win probabilities between a full (0.361) and no-fan season (-0.221). Since the home effect is symmetric for each team (the away field disadvantage = -the home field advantage), decreasing the variance does not affect the overall expected WL% for each team. However, the result of individual games will be different since home effect is asymmetric between teams. For example, if the Cubs (highest home effect in the NL Central) plays the Cardinals (lowest home effect in the NL Central), the Cubs will have a larger advantage playing at home then the Cardinals will have playing at home (besides team fixed effects). These differences are removed with the No-Fan scenario and the outcome will be solely based upon the talent of the teams. However, on average there only a slight change of overall WL% (or playoff berth), just the SD of the results (see Table 3). Without fans, any advantage (or disadvantage) from home field advantage, which cause higher levels of variance, is removed. This stabilizes the outcome based upon true team talent. As fewer games have been played, the half-season will have more upsets, but the SD is close to the same as the full season.

Conclusion

This paper analyzes the previous season MLB data to estimate the win-loss probabilities for the 2020 season for each of the 30 teams in the League using logit regressions and a neural network. The Arizona Plan’s neutralization of HFA would not significantly affect the overall outcome of the season. In fact, our model predicts that the Arizona Plan season will produce season results that are based more on the true talent of the teams. Further, our simulation demonstrates that there will be less variance in the win probability between any two teams, which we estimate will cause a larger divide between the best and worst teams. In conclusion, we believe that the results of the Arizona Plan will be similar to a regular season with fans, and that the teams’ standings at the end of the regular season will be more predictable than a normal season.

Data availability

Source data

Zenodo: Syracuse-University-Sport-Analytics/MLBCovid19. https://doi.org/10.5281/zenodo.3775959 (Ehrlich, 2020a).

This project contains the following source data files:

  • data/2008_2019Games.csv. (Input data scraped using the scrapingMLB.ipynb.)

  • data/divisions.csv. (Input team division data for grouping by division.)

  • data/mlbTeamColors.csv. (Input team colors for the visualizations.)

Source data are also available on GitHub: https://github.com/Syracuse-University-Sport-Analytics/MLBCovid19.

Extended data

Harvard Dataverse: Replication Data for: COVID-19 Countermeasures, Major League Baseball, and the Home Field Advantage. https://doi.org/10.7910/DVN/OOMWSD (Ehrlich, 2020b).

This project contains the following extended data files:

  • divisionRankings. (Results of simulation using the logit model.)

  • divisionRankingsNN. (Results of simulation using the neural network model.)

  • homeEffectLogit. (Team home effects using the logit model.)

  • homeEffectNN. (Team home effects using the neural network model.)

  • modelCorrelationsSummaryWithNNResults. (Simulation statistics from both the logit and neural network models.)

Zenodo: Syracuse-University-Sport-Analytics/MLBCovid19. https://doi.org/10.5281/zenodo.3775959 (Ehrlich, 2020a).

This project contains the following source files:

  • pythonStatcastScraper/scrapingMLB.ipynb. (Python Jupiter Notebook code for scraping Statcast.)

  • halfSeasonPrediction.Rmd. (R Markdown Notebook code for developing the logit and neural network models. Also contains the code for running the simulations.)

  • All the other data is intermediate output from the simulations. The important output files are located in the above Harvard Dataverse repository.

Source code is also available on GitHub: https://github.com/Syracuse-University-Sport-Analytics/MLBCovid19.

Mixed data and code hosted on GitHub and Zenodo are available under the terms of the GNU General Public License v3.0.

Data hosted on Harvard Dataverse are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 20 May 2020
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Ehrlich J and Ghimire S. COVID-19 countermeasures, Major League Baseball, and the home field advantage: Simulating the 2020 season using logit regression and a neural network [version 1; peer review: 3 approved with reservations] F1000Research 2020, 9:414 (https://doi.org/10.12688/f1000research.23694.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 20 May 2020
Views
6
Cite
Reviewer Report 28 Mar 2023
Federico Fioravanti, Institute for Logic, Language and Computation, Universiteit van Amsterdam, Amsterdam, North Holland, The Netherlands 
Approved with Reservations
VIEWS 6
General comments
The work examines three possible scenarios for the Major League Baseball 2020 season, motivated by the fact that due to Covid-19 countermeasures, there will be no presence of fans. The topic is interesting, but the authors should ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Fioravanti F. Reviewer Report For: COVID-19 countermeasures, Major League Baseball, and the home field advantage: Simulating the 2020 season using logit regression and a neural network [version 1; peer review: 3 approved with reservations]. F1000Research 2020, 9:414 (https://doi.org/10.5256/f1000research.26143.r164101)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
6
Cite
Reviewer Report 28 Mar 2023
Ismail Dergaa, Primary Health Care Corporation (PHCC), Qatar, Qatar 
Approved with Reservations
VIEWS 6
The aim of this study is to examine various scenarios of how Major League Baseball team performance is impacted by the presence or absence of fans in the context of physical distancing and COVID-19. While the article is well-structured and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Dergaa I. Reviewer Report For: COVID-19 countermeasures, Major League Baseball, and the home field advantage: Simulating the 2020 season using logit regression and a neural network [version 1; peer review: 3 approved with reservations]. F1000Research 2020, 9:414 (https://doi.org/10.5256/f1000research.26143.r164098)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
20
Cite
Reviewer Report 10 Jun 2022
Garry Kuan, Brunel University, Uxbridge, UK 
Approved with Reservations
VIEWS 20
GENERAL COMMENTS
 
The aim of this paper was to examine various scenarios of how Major League Baseball team performance is going to be impacted by the presence of fans, or the lack thereof, in the context of ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Kuan G. Reviewer Report For: COVID-19 countermeasures, Major League Baseball, and the home field advantage: Simulating the 2020 season using logit regression and a neural network [version 1; peer review: 3 approved with reservations]. F1000Research 2020, 9:414 (https://doi.org/10.5256/f1000research.26143.r140121)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 20 May 2020
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.