Wednesday, June 25, 2014

Fixing Run Differential, Part 2

Updating my previous post, I have tweaked my formula to estimate wins from run differential on a game-by-game basis.  The table below shows the expected win totals for each team so far this season based on runs scored (wRS), runs allowed (wRA), and run differential (wRD).  The expected win percentage (w%) and record are based on the run differential method.  In most cases, the run differential win total is in between the offensive estimate (wRS) and the run prevention (wRA) estimate.  Originally, I was just going to use an average of these two, but I think the method I used with run differential more more accurately assesses the value of runs in certain situations.

Rank Team Division wRS wRA wRD w% Record
1 OAK AL West 43.3 45.0 46.2 0.600 46 - 31
2 WSN NL East 36.6 44.2 41.5 0.546 42 - 34
3 MIL NL Central 41.9 40.9 43.1 0.546 43 - 36
4 SEA AL West 38.1 44.6 42.6 0.546 43 - 35
5 SFG NL West 39.1 42.5 41.9 0.544 42 - 35
6 LAD NL West 40.6 43.7 42.9 0.543 43 - 36
7 LAA AL West 41.7 37.9 40.4 0.539 40 - 35
8 TOR AL East 43.6 39.0 42.4 0.537 42 - 37
9 STL NL Central 36.8 46.5 41.8 0.536 42 - 36
10 DET AL Central 38.9 35.5 38.5 0.527 38 - 35
11 KCR AL Central 37.0 41.0 40.1 0.521 40 - 37
12 BAL AL East 38.1 38.8 39.0 0.513 39 - 37
13 CIN NL Central 35.9 41.2 38.4 0.505 38 - 38
14 NYM NL East 36.8 41.4 38.8 0.504 39 - 38
15 MIA NL East 38.8 37.0 37.7 0.490 38 - 39
16 ATL NL East 32.8 42.0 37.2 0.489 37 - 39
17 PIT NL Central 37.2 37.5 37.4 0.486 37 - 40
18 CHC NL Central 34.0 38.2 36.2 0.483 36 - 39
19 COL NL West 42.5 32.5 37.0 0.481 37 - 40
20 NYY AL East 37.1 37.4 36.5 0.480 36 - 40
21 MIN AL Central 39.2 35.4 36.0 0.480 36 - 39
22 PHI NL East 35.5 38.1 36.2 0.476 36 - 40
23 BOS AL East 36.8 39.3 36.8 0.472 37 - 41
24 CLE AL Central 39.3 33.2 36.1 0.469 36 - 41
25 CHW AL Central 40.0 34.9 36.6 0.469 37 - 41
26 TEX AL West 36.2 34.9 34.4 0.453 34 - 42
27 HOU AL West 35.0 36.7 35.3 0.453 35 - 43
28 SDP NL West 29.3 42.6 34.5 0.442 34 - 44
29 TBR AL East 34.4 39.9 34.7 0.439 35 - 44
30 ARI NL West 38.4 33.0 34.7 0.434 35 - 45


Wednesday, June 18, 2014

Fixing Run Differential, Part 1

Run differential (total runs scored minus runs allowed) is often cited as a better indicator of overall team performance than actual winning percentage.  In fact, there is a statistical basis for this statement, and the resulting equation that approximates winning percentage is the Pythagorean Win-Loss formula (where RS is runs scored and RA is runs allowed):
In many cases, this formula does an excellent job of assessing the most probable record for a team, given a number of runs scored and runs allowed, especially if the run differential is low.  The reason for this is that close games (extra inning games, 1-run games) are pretty close to a 50/50 coin flip, and the Pythagorean winning percentage for RS close to RA is approximately 50%.  For example, the 2012 Baltimore Orioles only had a +7 run differential (712 RS and 705 RA), but managed to finish with a 93-69 record.  Their overall record was most certainly helped by their 29-9 mark in 1-run games, the highest single season winning percentage (0.763) in the Expansion Era (1961-present) by any team.  [Interestingly, the Orioles franchise has the three highest single season winning percentages in this era: 1970 (0.727), 1981 (0.750), and 2012 (.763).]  Based on their run differential, however, their projected record using the Pythagorean Win-Loss formula was 82-80.  If we just replace the 29-9 record in 1-run games with 19-19 (a more probable outcome statistically), the Orioles record would have been 83-79, very close to the expected value.

The flaw in run differential as a metric, in my opinion, is that it often over-estimates how good a team is that has a large run differential.  This is due to the fact that it only takes into account overall run scoring, not the game-to-game distribution.  Let's take an extreme example in which a team scores 14 total runs over the course of 2 games.  If the team scores 0 in the first game and 14 in the second game, the expected number of wins is approximately 1, because the 0-run game is certainly a loss and the 14-run game is almost certainly a win.  However, if the team scores 7 runs in each game, the expected number of wins is much higher.  From 2010-2014, teams that scored exactly 7 runs won approximately 83% of the time.  Therefore, the expected number of wins over this two game stretch is 1.66.

So if run distribution matters, what can we do about it?  It occurred to me that instead of using overall run scoring, I could develop a metric that uses game-by-game run scoring (and run prevention) to estimate wins.  First, we need to develop a table that gives winning percentage for each value of runs scored (or allowed) in a single game.  To do this, I used the Baseball-Reference.com Play Index "Situational Records" tool for all games since the beginning of the 2010 season.  Here is the breakdown for each case:

RS/game W-L%
0 0.000
1 0.100
2 0.248
3 0.392
4 0.559
5 0.652
6 0.755
7 0.829
8 0.873
9 0.907
10 0.937
11 0.970
12 0.981
13+ 0.993

In order to get the same table for runs allowed, we can simply take 1 minus the W-L% column above.  (So scoring 3 runs in a game gives you a 39.2% chance to win, wheres allowing exactly 3 runs in a game gives you a 60.8% chance to win.)  From this table, it is evident that run differential, at least on a per game basis, has diminishing returns.  After about 6 runs scored, each additional run contributes less and less to the overall chance of winning.

Using the table above, we can compute an expected win total based on run distribution.  Let's say that a team plays 6 games and they score 2, 5, 3, 8, 1, and 4 runs in those games.  Their expected win total for those games would be (.248 + 0.652 + .392 + 0.873 + 0.100 + 0.559) = 2.824 wins.  In a similar way, we could compute the average number of wins based on the number of runs allowed in those 6 games.  To get an overall expected number of wins (based on run scoring and prevention), we could take the average.

Based on the methodology above, we can compute an expected record for the 2012 Orioles.  The number of wins based on run scoring is 83.9 and the number of wins based on run prevention is 81.5, for an overall average of 82.7 wins, which is very close to the Pythagorean win total (82).  This makes sense given the reasoning above regarding Pythagorean Win-Loss for small run differentials.

Now let's consider the 2014 Oakland Athletics, who have a 42-28 record and a +126 run differential (359 RS and 233 RA).  The Pythagorean Win-Loss formula estimates their record to be 48-22, or 6 games better than their actual record.  However, it is my conjecture that this estimate inflates the win total due to a large number of blowout wins (18 wins by at least 5 runs).  Based on my formula, the 2014 A's expected number of wins is 40.5.

In my next post, I will give expected records for all 30 teams based on this formula.  If you have any suggestions or comments about this post, feel free to leave them below.

Thursday, June 12, 2014

Alphabet Triplets

Most of you may know me as the mysterious "intern" of the now-defunct Baseball Today podcast.  I would often try to answer the "Ridiculous Question of the Day" by doing some research and writing small programs to sift through tons of play-by-play data.  One of the common questions I get is:  "What is the most difficult question that you answered?"  Well, it turns out that it was a question that never made it on to the podcast.  Frankly, it may have even been a little too ridiculous.

It all started with a question that was answered on the podcast.  A listener wanted to know: "When was the last time a starting lineup featured every letter of the alphabet at least once?"  On the podcast, Mark Simon called out myself and talented wordsmith/blogger Diane Firstman of the VORG to answer the question.  We both provided answers that were discussed on the show, but it was a follow-up exchange on Twitter with a fan of the podcast that really intrigued me.  Since it happened (relatively) often that a 9-man lineup featured every letter of the alphabet, we came up with a more ridiculous version:

What is the minimum number of players (using any names in baseball history, not just lineups) needed to cover the entire alphabet?

Obviously, since a single team's lineup contained every letter, the answer is 9 or less.  I also realized that the answer was more than 1, since no player's name contains every letter of the alphabet.  In fact, the greatest number of unique letters in a single player name is 15, and this rare distinction belongs to only one player in the RetroSheet database.  That player is Washington Fulmer, who only appeared in a single game for the 1875 Brooklyn Atlantics.  It also seemed unlikely that it would only take two player names to get the entire alphabet.  For example, a search using the 11 letters that do not appear in Mr. Fulmer's name (b, c, d, j, k, p, q, v, x, y, z) yields zero results.  I quickly wrote a program to count the number of unique letters in each name in the database and I got results ranging from 3 letters (Al Hall and C.C. Lee) to 15 letters (Washington Fulmer), with the average being 8.38 unique letters.  Based on this, I figured the answer would be 3 or 4, it was just a matter of finding the right players.

After a decent amount of searching through names, I quickly found that using 4 names was definitely possible.  I stumbled upon a trio of names - Felix Mackiewicz, Joseph Quinn, and Hy Vandenberg - that was only missing the letter "t."  Therefore, any player with the letter "t" in his name (a list that would include 6711 names) would complete the alphabet.  Armed with this knowledge, I believed that I could find 3 players to cover the entire alphabet (an "Alphabet Triplet"), but it would take a lot of searching.  Once I found one, I thought, I still would not be satisfied.  I wanted to find them all.  How many of these Alphabet Triplets exist?

If I was going to search for the elusive Alphabet Triplets, I certainly didn't want to use a brute force approach and I certainly would never find them all searching manually, one-by-one.  In fact, given the number of player names in the directory (18,174), the number of possible combinations of any 3 players was over 1 trillion.  ONE TRILLION.  I didn't have that kind of time.  I wrote a small piece of code to select random three-player combinations and check the number of unique letters.  I then executed the code for a short period of time and I was able to get almost 44000 combinations per second, which was good, but not good enough.  In fact, if I had to check every combination, it would take the program a full 263 days to complete!

In an effort to reduce the overall number of combinations I needed to check, I developed a strategy of targeting the least common letters first.  Of the 18,174 names in the database, here are the 8 least common letters (by number of names with at least one instance of the letter):

Q:  147 players (0.8% of all players)
X:  323 players (1.8%)
Z:  1194 players (6.6%)
V:  1854 players (10.2%)
F:  2073 players (11.4%)
P:  2407 players (13.2%)
W:  2684 players (14.8%)
J:  3459 players (19.0%)

I decided to focus on the 4 rarest letters (Q, X, Z, and V) and to employ some mathematical tricks.  Given a group of 3 players and 4 letters, there are a limited number of ways to distribute the letters.  One possibility is that one player has all 4 letters, in which case the other 2 players could by any player.  However, a search for a player with Q, X, Z, and V returns no results, so we can eliminate this case.  The second possibility is that one player has 3 of the 4 letters, one player has the 4th letter, and the third player is any player.  Breaking it down for the individual letters gives the following triplet possibilities, with the number of players in each list in parentheses:

{QXZ player (0), V player (1854), any player (18174)}
{QXV player (0), Z player (1194), any player (18174)}
{QZV player (13), X player (323), any player (18174)}
{XZV player (1), Q player (147), any player (18174)}

The first two possibilities can be eliminated since there are no QXZ players and no QXV players.  There are, however, 13 QZV players (including one of my all-time favorites, Omar Vizquel) and 1 XZV player (Xavier Hernandez).

In a similar way, one player can have 2 of the 4 letters, another player can have the other 2, and the third can be any player:

{QX player (2), ZV player (148), any player (18174)}
{QZ player (32), XV player (29), any player (18174)}
{QV player (17), XZ player (24), any player (18174)}

Finally, the last possibility is that one player has 2 of the 4 letters and the other 2 players each have one of the remaining two letters, giving these possible triplets:

{QX player (2), Z player (1194), V player (1854)}
{QZ player (32), X player (323), V player (1854)}
{QV player (17), X player (323), Z player (1194)}
{XZ player (24), Q player (147), V player (1854)}
{XV player (29), Q player (147), Z player (1194)}
{ZV player (148), Q player (147), X player (323)}

Combining all of these lists gives approximately 157 million combinations.  Given that the program I wrote can check about 44,000 every second, this should take only 1 hour!  That's much better than having to wait 263 days.  I let the program run, and when I came back, I had my complete list of 20 Alphabet Triplets:

Anthony Vasquez, Paxton Crawford, Jack Billingham
Esmerling Vasquez, Paxton Crawford, Johnny Blatnik
Esmerling Vasquez, Paxton Crawford, Jerry Buchek
Esmerling Vasquez, Paxton Crawford, John Buckley
Esmerling Vasquez, Paxton Crawford, Johnny Grabowski
Esmerling Vasquez, Paxton Crawford, Herby Jackson
Esmerling Vasquez, Paxton Crawford, John Kirby
Esmerling Vasquez, Paxton Crawford, Johnny Kucab
Jorge Vasquez, Paxton Crawford, Harry Kimberlin
Jorge Vasquez, Felix Doubront, Matthew Cepicky
Guillermo Velasquez, Paxton Crawford, Johnny Blatnik
Guillermo Velasquez, Paxton Crawford, Jerry Buchek
Guillermo Velasquez, Paxton Crawford, John Buckley
Guillermo Velasquez, Paxton Crawford, Johnny Grabowski
Guillermo Velasquez, Paxton Crawford, Herby Jackson
Guillermo Velasquez, Paxton Crawford, John Kirby
Guillermo Velasquez, Paxton Crawford, Johnny Kucab
Omar Vizquel, Paxton Crawford, Johnny Grabowski
Mox McQuery, Steve Filipowicz, John Brackenridge
Jeffrey Marquez, Alex Garbowski, Don Pavletich

(Note:  When I first ran the program, I got 23 results, but that list includes managers and umpires in the RetroSheet database who never actually played)

One final note:
Paxton Crawford is definitely the MVP of the Alphabet Triplets, as he appears in 17 of the 20 listed above.  Interestingly, he is the only player in history with a name that includes X, F, W and P (4 of the 7 rarest letters).

Now that's ridiculous!

Thursday, June 5, 2014

Appetite for Distraction

As often happens, I was using the Baseball Reference Play Index for one particular thing and I ended up somewhere else.  Based on the early season success of the Marlins hitters with runners in scoring position, especially at home, some people (perhaps those wearing tinfoil hats) accused them of stealing signs.  In order to check the validity of this statement, I checked each team's batting splits with a runner on 2nd base, since this is the most obvious situation that would allow the runner to steal signs.  When I saw the results, I thought I was on to something.  The Marlins have a team batting average of .301 with a runner on second (and no other runners), compared to a .260 overall average.  This difference is the largest for any team in the majors in 2014.  When I did a search for greatest difference with runners on 2nd and 3rd, I was expecting similar results.  However, I noticed that the Marlins are only hitting .192 in this situation, which is 68 points lower than their overall average (6th worst difference in MLB).  My thesis was shattered, but at least I didn't prove the conspiracy theorists correct.  It seems as if taking any of these splits seriously over such a small sample size can lead to poor conclusions...

After this search, I wanted to investigate larger sample sizes to see if I could find anything interesting.  I no longer wanted to look at particular teams, but rather the MLB as a whole.  Do batters hit better in particular situations? I used the Play Index to check league-wide batting average since 1960 (more than 50 years of data, a huge sample size) for each different combination of bases occupied.

DescriptionStateBA
No runners on0000.256
Runner on 1st1000.276
Runner on 2nd0200.247
Runner on 3rd0030.277
Runners on 1st and 2nd1200.254
Runners on 1st and 3rd1030.294
Runners on 2nd and 3rd0230.270
Bases Loaded1230.279
All states---0.261

Intuitively, having more runners on base should lead to a higher batting average.  Why?  Well, mostly due to selection bias.  Simply put, selecting states in which a pitcher may be struggling (more runners on base) should result in better performance by the batter, even if the batter isn't actually any better.  In addition, based on traditional lineup construction, good hitters should get more opportunities with runners on base than average or poor hitters.  For the most part, the table above supports this argument.  With no runners on base, the batting average is .256, and if we combine the other 7 states, the batting average with at least one runner on base is .268, a 12 point increase.  However, breaking down the individual states paints a different picture.  If more runners correlates with a higher batting average, then why is the batting average with runners on 1st and 3rd (.294) significantly better than the batting average with the bases loaded (.279)?

As it turns out, adding a runner to second base always decreases batting average.  To see this effect, we can arrange the 8 baserunner combinations into pairs to isolate the effect.  Each pair includes an initial state without a runner on 2nd base and the same state with a runner on second base.  In all four cases, the batting average is worse with the extra runner on 2nd.

Initial State
BA Added runner on 2nd
BA Difference
No runners on 000 0.256 Runner on 2nd 020 0.247 -0.009
Runner on 1st 100 0.276 Runners on 1st and 2nd 120 0.254 -0.022
Runner on 3rd 003 0.277 Runners on 2nd and 3rd 023 0.270 -0.007
Runners on 1st and 3rd 103 0.294 Bases Loaded 123 0.279 -0.015

Conversely, adding a runner to either 1st base or 3rd base always increases batting average.  Similar to the table above, we can isolate the effect in each case.  Adding a runner to first base increases batting average by 7 to 20 points:

Initial State
BAAdded runner on 1st
BADifference
No runners on0000.256Runner on 1st1000.2760.020
Runner on 2nd0200.247Runners on 1st and 2nd1200.2540.007
Runner on 3rd0030.277Runners on 1st and 3rd1030.2940.017
Runners on 2nd and 3rd0230.270Bases Loaded1230.2790.009

And adding a runner to 3rd base increases batting average by 18 to 25 points:

Initial State
BA Added runner on 3rd
BA Difference
No runners on 000 0.256 Runner on 3rd 003 0.277 0.021
Runner on 1st 100 0.276 Runners on 1st and 3rd 103 0.294 0.018
Runner on 2nd 020 0.247 Runners on 2nd and 3rd 023 0.270 0.023
Runners on 1st and 2nd 120 0.254 Bases Loaded 123 0.279 0.025

The point with these last two tables is not to prove, somehow, that a batter suddenly becomes a better hitter with runners on base.  As discussed earlier, a large part of this increase is probably due to selection bias.  But, it does seem logical that adding a runner to second base should also exhibit this effect, and the magnitude of the increase should be somewhere in between the effects shown for adding a runner to 1st and adding a runner to 3rd (maybe about 10 to 15 points in batting average).  In reality, the effect is quite the opposite (7 to 22 point decrease).

Is there any logical explanation for the data?  Conservatively, hitters are about 20 to 25 points worse in terms of batting average with a runner on 2nd than what we might expect.  I believe that this effect is almost entirely due to the distraction of having a runner directly in the batter's line of sight.  While the batter is trying to concentrate on the pitcher's delivery, release, and the trajectory and spin of the the ball, he has to cope with a teammate dancing around next to second base behind the pitcher.  Even if the batter tries to "block out" the runner, it almost certainly has an effect that cannot be muted.  The effect seems to be worse if a runner is on 2nd base and 3rd base is open, as the two worst batting averages are with a runner on 2nd (.247) and with runners on 1st and 2nd (.254).  In these cases, the baserunner at 2nd probably moves around more since he is the lead runner and can possibly steal 3rd or take off early to score on a single.  Is there anything that can be done to mitigate this "distraction" effect?  Maybe if you're David Ortiz and you just hit a leadoff double, you should think about taking a lead and then standing still until the ball is hit.