Baseball enthusiasts may be the nerdiest of sports fans. From the development of the box score in 1858, there has long been an obsession with statistics. These statistics have grown increasingly complex with the advent of sabermetrics, aka Moneyball. OPS, WHIP, and FIP have replaced old school stats like batting average, RBIs, and won-loss records.
One of the best-known ways to compare players across teams and across eras is through Wins Above Replacement or WAR. The idea is to consider a wide range of hitting, pitching, and fielding statistics to evaluate how much an individual player contributes to his team’s success. That player’s performance is compared to that of a “replacement player,” essentially a borderline major leaguer. An all-star is worth about 5 extra wins a year. An MVP is worth 8 or more wins.
WAR has its share of critics. It seems to underrate positions such as relief pitchers and catchers. Mariano Rivera has a career WAR of 56.3, which places him 79th among pitchers, just behind Jerry Koosman, Tim Hudson, and Dave Stieb. Johnny Bench ranks top among catchers with a WAR of 75.1 but that would only place him seventh among second basemen, tied with Lou Whittaker. Disentangling individual contributions in a team sport is hard to do with any degree of certainty. But how closely does Wins Above Replacement project …wins?
WAR as a Model
I spent my career as a bank examiner, and we were sometimes called upon to review a bank’s models. Models have been described as, “simplified representations of real-world relationships among observed characteristics, values, and events.” A big bank might have thousands of models that it uses for credit decisions, stress testing, and risk management. We look for certain things when assessing a model. WAR is a sort of model, and we can evaluate it using a similar approach.
The most important consideration is whether a model measures what it purports to measure. In this case, how well do wins above replacement, at an aggregate team level, predict wins? I looked at aggregate WAR and actual wins for the top four teams in each league for 1969, 1979, 1989, 1999, 2009, and 2019. I tried to keep the sample size to a manageable level to allow for analysis of individual teams, especially those that are outliers. For purposes of this analysis, I used WAR as defined by baseball-reference.com.
Results
A correlation coefficient (r) measures the strength of the linear relationship between two variables and ranges between -100% and +100%. An r of zero indicates no relationship. A team’s WAR and win total had a correlation of 83.5%. More about how good that is a little later. You can also use regression analysis to project wins based on the relationship between WAR and Wins and then compare predicted wins with actual wins for individual teams. The graph below shows this relationship.
The solid upward sloping line shows the linear relationship between WAR and Wins. A flat horizontal line would indicate no correlation. The dots show results for individual teams. While the dots are generally clustered around the solid line, it can be useful to also look at outliers – those with the biggest difference between predicted Wins and actual Wins. The top overperformers and underperformers relative to their WARs are shown below:
There’s not an obvious pattern with these teams. The Mets, Dodgers, and 1999 Braves had great pitching, but the 2019 Braves were just okay in that respect. All four of the overachievers were led by Manager of the Year winners, but only Gil Hodges of the 1969 Mets won it for that year. Plus, leaders of three of the four underachievers had also won a Manager of the Year Award, and Jimy Williams won it for the “underperforming” Red Sox in 1999.
You can also look at the ability of the model to rank order. In other words, how closely did a team’s rank in terms of WAR correspond to its rank in terms of wins. The two ranks didn’t match exactly, but they were rarely far off. Of the 48 teams reviewed, the average difference in the rankings was 6.5 and in only two cases did the ranks differ by more than 20. Since we only looked at the first and second place finishers in each division, the number of victories is closely bunched together. It’s easier to distinguish between good teams and bad teams than between good and very good teams. The 6.5 difference in ranking translates to an average difference of less than 3 wins.
While WAR technically projects the number of wins, it’s more a reflection of run differential. WAR estimates how hitting produces runs and how pitching and fielding prevent them. Actual wins may reflect either luck or clutch play, depending on your perspective. Pythagorean W-L projects win totals by runs scored and runs scored against and doesn’t take winning or losing close games into account. If we compare WAR to Pythagorean Wins, the overall result isn’t much different, with a correlation of 84.7%.
A Challenger Model
A correlation of 83.5% seems good. WAR obviously predicts Wins better than a simple roll of the dice. But a better question is, compared to what? The is called “benchmarking” and it usually starts with a challenger model. How well does WAR perform compared to some other, plausible approach to predicting a team’s wins?
That means coming up with a challenger model. There are rival approaches similar to WAR, such as Win Shares, developed by Bill James. But much of the debate around WAR revolves around comparing it to earlier approaches to measuring baseball excellence. I’m using as a challenger model what I’ll call OSS (“old school stat”). OSS has two components: batting average and ERA. Specifically:
OSS = (Team BA/League BA) + (Lg ERA/Team ERA)
You could get pretty much all the information you needed to calculate OSS from a Sunday newspaper – in 1975. OSS weights pitching and hitting equally and doesn’t account for walks, baserunning, extra base hits, or fielding. Even with these limitations, you’d expect team batting average and ERA to correlate with wins.
OSS turns out to be a pretty good predictor of wins, with a correlation of 74.8%. It also does okay in terms of rank ordering. OSS predicted the three winningest teams exactly. The average difference in the rankings of actual wins and OSS was 8.2, which translates to a little fewer than 4 wins.
OSS equally weights relative batting average and relative ERA. We can also come up with an optimized OSS by using multiple regression. That will weight relative BA and relative ERA in a way that best fits the data. In that case, the correlation rises to 81.05%, nearly as high as for WAR.
While OSS relies on traditional stats, it also incorporates some more modern elements. It looks at batting average or ERA not in isolation but relative to league averages at the time. Lefty O’Doul had a lifetime batting average 64 points higher than Carl Yastrzemski. But Lefty played in the 1920s and 1930s. In his best year, 1929, the league batting average was .294. Yaz’s prime years were in the 1960s and early 1970s, where he could lead the league in batting three times without ever hitting above .330. If we just look at nominal batting average and nominal ERA, the correlation with wins is only 43%. The table below compares the four methods.
Concluding Thoughts
A team’s aggregate WAR is a reasonably accurate predictor of a team’s wins, but not that much better than less sophisticated, more traditional baseball statistics. A bigger difference emerges when we adjust for league norms, especially when we compare teams (and presumably individual players) across eras. That’s also a relatively modern innovation. Total Baseball, first published in 1989, included adjustments for parks and league norms. My 1993 edition of The Baseball Encyclopedia did not.
After adjusting for when and where a team played, WAR barely outperforms more traditional measures. That’s at least for the sample I used. Using all 2600+ major league teams from 1901 on might yield a different answer. In addition, none of this is to say that fielding, baserunning, or hitting for power doesn’t matter. But including those additional, often imprecise measurements may create nearly as much noise as signal. WAR can also be a useful measure. But it’s not the be all and end all when it comes to baseball performance.
If you enjoyed this article, feel free to pass it on to your friends and colleagues.