Monte Carlo: How Much Does Luck Move a Season's Record?

FoundationsIntermediatePython~5 min read

What you'll build

Thousands of simulated 162-game seasons for two real teams, showing how far luck alone swings a record.

Thousands of simulated 162-game seasons for two real teams, showing how far luck alone swings a record.
Data: Bundled sample (2023 MLB standings) + simulation, retrieved June 2026

Here's a fact that should bother you a little: two teams with the exact same true talent can finish a season ten games apart in the standings, and neither did anything differently. That gap is luck. The way you measure it is Monte Carlo simulation — run the same random process thousands of times and look at the range of what comes out. It's the engine inside every playoff-odds model, and once you've built one, regression to the mean stops being a slogan and becomes arithmetic.

This builds on Summary Statistics and Distributions. The data is the bundled sample_standings.csv (real 2023 MLB standings), so it runs offline.

  1. One season is a string of coin flips

    Treat a team's winning percentage as its "true talent" — the probability it wins any given game. A 162-game season is then 162 weighted coin flips, and the win total is a draw from a binomial distribution. A fixed seed makes the whole thing reproducible.

    python
    import numpy as np
    import pandas as pd
    
    df = pd.read_csv("sample_standings.csv")
    rng = np.random.default_rng(2026)      # fixed seed -> same result every run
    
    best = df.loc[df["WinPct"].idxmax()]                       # the best team
    avg  = df.loc[(df["WinPct"] - 0.5).abs().idxmin()]         # a ~.500 team
    
    def simulate(p, games=162, trials=20000):
        return rng.binomial(games, p, size=trials)             # one number = one season's wins

    That's the whole model: rng.binomial(162, p, 20000) plays twenty thousand seasons for a team of talent p and hands back twenty thousand final win totals. No loop needed — numpy vectorizes it.

  2. Read the spread, not the average

    The mean of the simulations just recovers the talent you put in (162 × p). The interesting part is the spread — the 5th-to-95th-percentile range tells you how far a record can drift on luck alone.

    python
    for team, row in ((best["Team"], best), (avg["Team"], avg)):
        wins = simulate(row["WinPct"])
        lo, hi = np.percentile(wins, [5, 95])
        print(team, round(wins.mean(),1), "W,  90% range", int(lo), "-", int(hi))
    True talent in, a range of records out
    Braves     true talent 0.642  ->  mean 104.0 W, std  6.1, 90% range 94-114 (20 wins wide)
    Padres     true talent 0.506  ->  mean  82.0 W, std  6.4, 90% range 72-92 (20 wins wide)
    
    Same true talent, ~20-win spread: that gap is pure luck.

    Look at the width: even with talent held perfectly fixed, the 90% range is about 20 wins wide. A team can be a true 90-win club and finish anywhere from the low 80s to 100 without the underlying quality changing at all. That is why a hot or cold record over a partial season tells you much less than it feels like it should.

  3. Picture the luck

    Histogram the simulated win totals for both teams on the same axis and the overlap makes the point no table can: the tails of a good team and an average team reach into each other's territory.

    python
    import matplotlib.pyplot as plt
    
    fig, ax = plt.subplots(figsize=(9, 5.5))
    for team, row, color in ((best["Team"], best, "#B23A3A"), (avg["Team"], avg, "#2C5E8A")):
        wins = simulate(row["WinPct"])
        ax.hist(wins, bins=range(60, 120), alpha=0.6, color=color, label=team)
        ax.axvline(wins.mean(), color=color, ls="--")
    ax.set_xlabel("simulated wins"); ax.legend()
    fig.savefig("season_spread.png", dpi=144, bbox_inches="tight")
    Two overlapping histograms of simulated 162-game win totals: a strong team centered around 104 wins and a .500 team around 81, each roughly 20 wins wide at the 90% range, with dashed mean lines
    Data: Bundled sample (2023 MLB standings) + simulation, retrieved June 2026

    This same loop scales to anything you can assign a probability: simulate a full league of teams and count how often each makes the playoffs, and you've built a playoff-odds model. The honest version always reports the distribution, never a single confident number — because, as the chart shows, the single number was never the whole story.

Troubleshooting

My numbers change every run

You didn't seed the generator. Create it once with rng = np.random.default_rng(SEED) and reuse that rng; a fixed seed makes a simulation reproducible, which you want for a published chart.

Isn't winning percentage already polluted by luck?

Yes — using a team's actual record as its "true talent" overstates the talent of lucky teams. It's fine for showing the size of season-level variance, but a real projection would start from a regressed talent estimate (or a Pythagorean expectation), not the raw record.

How many trials do I need?

Enough that the answer stops moving. A few thousand is plenty for a mean; for tail probabilities (e.g. "odds of 100+ wins") push to tens of thousands. Re-run with different seeds and watch how much the estimate wobbles — that wobble is your Monte Carlo error.

Challenge yourself

Turn this into a one-number "luck" report: for every team in the file, simulate the season and compute the probability it wins at least as many games as its real record implies. Then build the playoff-odds version — assign each team a talent, simulate the whole league many times, and report how often each finishes in the top tier. Compare a seed-2026 run against seed-1 to see your Monte Carlo error.

Get the code

Here's the complete, working script for this tutorial. It runs exactly as shown.

Download the finished script (62_monte_carlo_season_simulation.py)

This script imports a small shared helper (and reads any bundled sample data) that live next to it in /downloads/ — grab these into the same folder so it runs as-is: sdt_common.py.

More Foundations tutorials