Build an xG League Table from Understat Data

SoccerIntermediatePython~8 min read

What you'll build

A league table re-ranked by expected goals instead of actual results.

A league table re-ranked by expected goals instead of actual results.
Data: Understat, retrieved June 2026

Goals lie. A team can score from a wild deflection or get robbed by a great save, and the scoreline won't tell you which team actually played better. Expected goals - xG - is the stat that cuts through the noise. So pull a full Premier League season from Understat, roll it up into a league table ranked by xG, and then comes the fun part: set each team's xG against the goals they really scored. That gap is what exposes who finished clinically and who left points on the pitch.

Before that, a quick definition. xG measures the quality of the chances a team created. Every shot is assigned a probability of becoming a goal based on where it was taken, how, and in what situation - a tap-in might be 0.8 xG, a speculative 30-yard effort 0.03. Add up all of a team's shots and you get the number of goals an average team "should" have scored from those chances. Compare that to reality and you've measured finishing.

This builds on your first StatsBomb pull - same idea of loading real event-derived data, different source. Let's get the data the right way, because most guides online get this part wrong.

  1. Get the data the modern way (a POST request)

    Here's the thing almost every older tutorial gets wrong. For years the trick to scraping Understat was to download the page HTML and yank a JSON variable out of a <script> tag with a regular expression. That no longer works - Understat changed how its pages load. If you've been following a guide that does re.search on the page source, that's why it's failing.

    The site now serves this data from a dedicated POST endpoint, the very same one the page itself calls in the background. We just ask it directly, exactly the way the browser does: a POST to getPlayersStats with the league and season as form data, plus the X-Requested-With header that marks it as a background request.

    python
    import json
    import pandas as pd
    import requests
    
    LEAGUE, SEASON = "EPL", "2023"   # Understat's 2023 = the 2023/24 season
    
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0",
        "Referer": "https://understat.com/",
        "X-Requested-With": "XMLHttpRequest",
    })
    resp = session.post("https://understat.com/main/getPlayersStats/",
                        data={"league": LEAGUE, "season": SEASON}, timeout=30)
    resp.raise_for_status()
    players = resp.json()["players"]

    One subtlety worth knowing: Understat labels seasons by their starting year, so "2023" means the 2023/24 campaign. The response is clean JSON, and ["players"] pulls out the list of player-season records. No HTML parsing, no fragile regex - just an API call.

  2. Load it and fix the data types

    Drop the records into a DataFrame and you'll hit the classic web-data gotcha: the numbers arrive as strings. JSON from a web endpoint loves to send "27" instead of 27, and if you try to sum strings you'll get nonsense (or an error). So we cast the columns we care about - goals and the two xG flavors - to real numbers with pd.to_numeric.

    python
    df = pd.DataFrame(players)
    for col in ["goals", "xG", "npxG"]:
        df[col] = pd.to_numeric(df[col])
    
    print(f"{len(df)} players pulled. Top xG individuals:")
    top = df.sort_values("xG", ascending=False).head(5)
    print(top[["player_name", "team_title", "goals", "xG"]])
    The top xG players in the league
    570 players pulled. Top xG individuals:
            player_name        team_title  goals         xG
    0    Erling Haaland   Manchester City     27  31.653997
    2    Alexander Isak  Newcastle United     21  22.074266
    6     Mohamed Salah         Liverpool     18  21.941274
    3   Dominic Solanke       Bournemouth     19  21.406831
    12  Nicolas Jackson           Chelsea     14  19.860594

    That's a great sanity check. Erling Haaland tops the chart with 31.7 xG, and he scored 27 - the league's most dangerous chance-getter, as you'd expect. Notice Mohamed Salah at 21.9 xG but only 18 goals: a first hint that Liverpool's finishing didn't keep pace with their chances. Hold that thought.

  3. Keep only single-club players

    There's a data-quality wrinkle to handle before we total up teams. A handful of players transferred mid-season, and Understat stores their season under a combined label like "ClubA,ClubB" - their goals and xG lumped across both clubs with no clean way to split them apart. If we left those rows in, we'd get phantom 21st and 22nd "teams" in our table.

    Because it's only a few players, the cleanest fix is to keep players who stayed at one club all season. We spot the combined labels by the comma and filter them out.

    python
    single_club = df[~df["team_title"].str.contains(",")].copy()
    single_club = single_club.rename(columns={"team_title": "team"})

    The ~ means "not," so this keeps every row whose team name does not contain a comma. Yes, we lose a sliver of league xG this way - the goals those transferred players scored - but in exchange we get an honest, clean 20-team table. That's a trade worth making, and the kind of judgment call real data work is full of. We'll note the caveat right on the chart.

  4. Roll players up into teams

    Now the satisfying part. We groupby("team") to gather every player onto their club, then sum their goals and xG into team totals. The extra Diff column - actual goals minus expected - is the whole story in one number: positive means a team scored more than their chances deserved (clinical finishing), negative means they wasted chances.

    python
    table = (single_club.groupby("team")
             .agg(Goals=("goals", "sum"), xG=("xG", "sum"))
             .round(1))
    table["Diff"] = (table["Goals"] - table["xG"]).round(1)   # + = clinical finishing
    table = table.sort_values("xG", ascending=False)
    
    out = table.reset_index()
    out.index = range(1, len(out) + 1)
    print(out.to_string())
    The xG league table, 2023/24
                           team  Goals    xG  Diff
    1                 Liverpool     80  97.1 -17.1
    2           Manchester City     94  89.4   4.6
    3          Newcastle United     83  86.6  -3.6
    4                   Arsenal     85  85.4  -0.4
    5               Aston Villa     72  69.5   2.5
    6               Bournemouth     52  64.7 -12.7
    7                  Brighton     50  63.0 -13.0
    8                 Tottenham     64  61.9   2.1
    9                   Chelsea     53  61.5  -8.5
    10                  Everton     40  61.3 -21.3
    11        Manchester United     57  60.6  -3.6
    12                Brentford     48  58.7 -10.7
    13                 West Ham     58  55.3   2.7
    14           Crystal Palace     56  54.3   1.7
    15        Nottingham Forest     49  52.2  -3.2
    16  Wolverhampton Wanderers     47  51.1  -4.1
    17                    Luton     49  50.9  -1.9
    18                   Fulham     49  48.3   0.7
    19                  Burnley     40  44.0  -4.0
    20         Sheffield United     24  33.6  -9.6

    Read down the Diff column and the season comes alive. Liverpool created the most chances in the league - 97.1 xG, more than anyone - but scored only 80, a brutal −17.1. They badly underperformed what they generated. Meanwhile Manchester City created less (89.4 xG) yet scored 94, a +4.6 - finishing their chances with ruthless efficiency. That gap is the difference between dominating the ball and dominating the scoreboard.

  5. Chart expected against actual

    A table makes you do the subtraction in your head. A chart makes the gap jump off the page. We'll draw each team's xG as a horizontal bar and overlay a dot for the goals they actually scored. When the dot sits to the right of the bar's end, the team overperformed; to the left, they underperformed. We sort ascending so the biggest xG sits at the top.

    python
    import matplotlib.pyplot as plt
    
    plot_df = table.sort_values("xG")
    fig, ax = plt.subplots(figsize=(8.4, 7.4))
    ax.barh(plot_df.index, plot_df["xG"], color="#3A7D44", alpha=0.85,
            label="xG (chances created)")
    ax.scatter(plot_df["Goals"], plot_df.index, color="#20242B", zorder=5, s=42,
               label="goals actually scored")
    ax.set_title("EPL 2023/24: expected vs actual goals")
    ax.set_xlabel("goals")
    ax.legend(loc="lower right", fontsize=9, frameon=False)
    fig.savefig("xg_table.png", dpi=144, bbox_inches="tight")
    Horizontal bar chart of Premier League teams' xG with dots for actual goals, 2023/24 season
    Data: Understat, retrieved June 2026

    Find Liverpool at the top: the longest bar in the league, but its goals-dot sits well to the left - all that chance creation, not enough end product. Then find Manchester City just below, dot sitting to the right of its bar. The picture explains a whole season in one glance, and it's why xG has become the analyst's favorite lens on football.

Troubleshooting

My old code that scrapes a variable from the HTML returns nothing

That approach is dead - Understat no longer embeds the data as an inline JSON variable, so the regex finds nothing. Use the POST endpoint shown in step 1 instead. If a tutorial tells you to re.search the page source for playersData, it's out of date.

A 403 Forbidden or empty response from the POST

The server expects the request to look like it came from the site. Make sure you send both the Referer: https://understat.com/ header and X-Requested-With: XMLHttpRequest, plus a normal User-Agent. Missing any of these is the usual cause.

TypeError when summing, or wildly wrong totals

You forgot to cast the columns. Straight from JSON, goals and xG are strings, so summing them concatenates text instead of adding numbers. Run pd.to_numeric on every numeric column before you group, exactly as in step 2.

Challenge yourself

Swap xG for npxG - non-penalty xG - and rebuild the table. Penalties are near-automatic goals that inflate a team's expected total, so stripping them out gives a cleaner read on open-play chance creation. Does anyone's ranking move? Then go bolder: change LEAGUE to "La_liga" or "Bundesliga" and rerun the whole pipeline - the endpoint speaks every league Understat covers, so your table travels for free.

Get the code

Here's the complete, working script for this tutorial. It runs exactly as shown.

Download the finished script (13_build_an_xg_league_table_from_understat.py)

This script imports a small shared helper (and reads any bundled sample data) that live next to it in /downloads/ — grab these into the same folder so it runs as-is: sdt_common.py.

More Soccer tutorials

A team's completed passes drawn as arrows on a proper pitch with mplsoccer.
Soccer Intermediate

Draw a Pass Map with mplsoccer

Filter a match's passes from StatsBomb event data and draw them as arrows on a correctly-proportioned pitch using mplsoccer, with StatsBomb attribution.

~7 min
Both teams' shots on a pitch, sized by xG and marked for goals.
Soccer Intermediate

Build a Match Shot Map with Expected Goals

Plot every shot from a real match on a pitch with mplsoccer, sizing each by its expected-goals value and highlighting goals - the single most useful soccer viz there is.

~8 min