Cleaning Messy Sports Data: Real-World Fixes for Real-World Files

FoundationsIntermediatePython~8 min read

What you'll build

A tidy, correctly-typed DataFrame rescued from a deliberately messy CSV.

A tidy, correctly-typed DataFrame rescued from a deliberately messy CSV.
Data: Bundled sample (deliberately messy team stats), retrieved June 2026

Nobody puts this on a course syllabus, and it's where you'll spend half your time: cleaning data. The polished tutorials hand you a perfect file and skip straight to the fun part. Real files don't cooperate. A scraped table, a copied spreadsheet, an export from some ancient system - they show up with stray spaces, percent signs glued to numbers, thousands separators, blank cells, and a duplicated row or two hiding in the middle. Learn to fix that calmly and you can work with almost any data. Skip it and your beautiful analysis quietly runs on garbage.

So we're going to take a deliberately messy file - messy_team_stats.csv, real 2023 MLB numbers from the MLB Stats API that I roughed up on purpose to look like the exports you'll actually meet - and walk it from unusable text to a tidy, correctly-typed table. One defect at a time, and I'll prove each fix rather than trusting it. The discipline I want you to steal from this is the order: diagnose first, fix second, verify third. Guessing at fixes blindly is exactly how clean data turns into wrong data without anyone noticing. It builds on the pandas basics - selecting, the .dtypes check, a little string work.

  1. Diagnose the damage

    Before changing anything, we look. The fastest tell is the combination of column names and dtypes. We load the file and print both.

    python
    import pandas as pd
    
    raw = pd.read_csv("messy_team_stats.csv")
    print("Raw columns:", list(raw.columns))
    print()
    print(raw.dtypes)
    The columns are a mess
    Raw columns: ['team ', ' league', ' win_pct ', ' runs_scored ', ' runs_allowed ', ' games']
    
    team                  str
     league               str
     win_pct              str
     runs_scored        int64
     runs_allowed     float64
     games              int64
    dtype: object

    Two problems jump out. First, the column names carry leading and trailing spaces — 'team ', ' league', ' win_pct ' — which means typing raw["team"] would raise a KeyError because the real name has a trailing space. Second, win_pct came in as str (text), not a number, so you can't average or plot it yet. Now let's look at the actual values.

    python
    print(raw.head(5).to_string())
    The values are dirty too
           team   league  win_pct    runs_scored    runs_allowed    games
    0    BRAVES       NL     64.2%            947           716.0     162
    1    Orioles      AL     62.3%            807           678.0     162
    2    Dodgers      NL     61.7%            906           699.0     162
    3       Rays      AL     61.1%            860             NaN     162
    4    Brewers      NL     56.8%            728           647.0     162

    The rot is everywhere: team names like BRAVES in all caps and others (ASTROS) padded with spaces, win percentages written as 64.2%, and a NaN sitting where the Rays' runs-allowed value should be. There's also a duplicate Braves row hiding at the bottom of the file, which we'll catch shortly. None of this is fatal — it's just a checklist.

  2. Fix the column names

    We tackle the headers first, because every later fix references a column by name. Stripping whitespace from all of them at once is a one-liner: .str.strip() applied to raw.columns trims each name. Then we make a working copy so our edits clearly belong to a clean table.

    python
    raw.columns = raw.columns.str.strip()
    df = raw.copy()

    From here on, df["team"], df["win_pct"], and the rest all resolve cleanly — no invisible spaces to trip over.

  3. Normalize the text and parse the percent

    Next the team names. We chain two string operations: .str.strip() removes the padding spaces, and .str.title() fixes the capitalization so BRAVES and ASTROS become Braves and Astros — consistent casing matters the moment you try to group or join on team name.

    python
    df["team"] = df["team"].str.strip().str.title()

    Now the percent column. A value like "64.2%" is text because of that trailing %. We strip the sign, convert the remaining text to a real number with .astype(float), then divide by 100 so it becomes a proper proportion like 0.642.

    python
    df["win_pct"] = df["win_pct"].str.replace("%", "", regex=False).astype(float) / 100

    The order is deliberate: you must remove every non-numeric character before astype(float), or the conversion throws a ValueError on the stray symbol.

  4. Handle thousands separators and the missing value

    If a runs column had arrived as text like "1,247", the comma would block conversion, so the robust habit is to strip commas before casting to a number. We do that defensively here even though this sample's runs are small.

    python
    df["runs_scored"] = df["runs_scored"].astype(str).str.replace(",", "", regex=False).astype(int)

    The blank cell is more interesting, because you have to decide what to do with missing data rather than just reformat it. First we make the column reliably numeric with pd.to_numeric(..., errors="coerce"), which turns any unparseable cell into NaN instead of crashing. Then we count the gaps and fill them — here with the column's median, a safe, outlier-resistant stand-in.

    python
    df["runs_allowed"] = pd.to_numeric(df["runs_allowed"], errors="coerce")
    missing = int(df["runs_allowed"].isna().sum())
    df["runs_allowed"] = df["runs_allowed"].fillna(df["runs_allowed"].median())

    Filling is a judgment call, not a default. The median is reasonable here, but in other situations you might drop the row, carry the previous value forward, or leave the gap explicit — what matters is choosing deliberately and being able to say how many values you touched.

  5. Drop the duplicate row

    That repeated Braves line would double-count a team in any total or average. drop_duplicates removes exact repeats; we count how many vanished and reset the index so the rows renumber cleanly from zero.

    python
    before = len(df)
    df = df.drop_duplicates().reset_index(drop=True)
    removed = before - len(df)

    Measuring the row count before and after is the cheap, honest way to confirm the operation did what you think — never assume; count.

  6. Prove the fix

    A cleaning job isn't done until you can demonstrate it worked. We report what we changed and re-check the dtypes — the same diagnostic we ran at the start, now expecting clean results.

    python
    print(f"Filled {missing} missing runs_allowed value(s) with the median.")
    print(f"Dropped {removed} duplicate row(s).")
    print()
    print(df.dtypes)
    What changed, and the new types
    Filled 1 missing runs_allowed value(s) with the median.
    Dropped 1 duplicate row(s).
    
    team                str
    league              str
    win_pct         float64
    runs_scored       int64
    runs_allowed    float64
    games             int64
    dtype: object

    One missing value filled, one duplicate removed, and every column now carries a sensible type: team and league are str, win_pct and runs_allowed are float64, and runs_scored is int64. Crucially, win_pct is no longer text. Here's the cleaned table itself.

    python
    print(df.head(5).to_string())
    The cleaned table
          team league  win_pct  runs_scored  runs_allowed  games
    0   Braves     NL    0.642          947         716.0    162
    1  Orioles     AL    0.623          807         678.0    162
    2  Dodgers     NL    0.617          906         699.0    162
    3     Rays     AL    0.611          860         721.0    162
    4  Brewers     NL    0.568          728         647.0    162

    Compare this to the raw printout from step one: Braves is title-cased, 0.642 is a real number, and the Rays' once-blank runs-allowed now reads 721.0 — the median fill we computed. The data went from unusable to analysis-ready.

  7. Confirm it with a chart

    Here's the real test of whether win_pct is genuinely numeric: try to plot it. A histogram simply cannot be drawn from text, so if this renders, the column is truly fixed. We bin the win percentages into ten buckets to see how the league's records were distributed.

    python
    import matplotlib.pyplot as plt
    
    fig, ax = plt.subplots(figsize=(7.6, 4.4))
    ax.hist(df["win_pct"], bins=10, edgecolor="#FBF7EE")
    ax.set_title("Win percentage is numeric now - so we can plot it")
    ax.set_xlabel("win percentage")
    ax.set_ylabel("number of teams")
    fig.savefig("cleaned_distribution.png", dpi=144, bbox_inches="tight")
    Histogram of 2023 MLB team win percentages in ten bins, clustered in the middle around .500 with thinner tails at the extremes
    Data: Bundled sample (deliberately messy team stats), retrieved June 2026

    The bars cluster in the middle near .500 and thin out toward the extremes — the familiar shape of a league where most teams finish somewhere around average. That the chart drew at all is the final proof: a column that was useless text a few steps ago now behaves like the number it always should have been.

Troubleshooting

KeyError: 'team' right after loading the file

The real column name has hidden whitespace — it's 'team ', not 'team'. Print list(df.columns) to see the exact names, then run df.columns = df.columns.str.strip() before referencing any column. This is always the first fix on a messy file.

ValueError: could not convert string to float: '64.2%'

astype(float) hit a character it can't parse. Remove every non-numeric symbol first — the % here, commas elsewhere — with .str.replace(...), and only then convert. The cleaning order matters: strip, then cast.

A whole numeric column is still str (object) after cleaning

One bad cell is forcing the entire column to text. Use pd.to_numeric(df["col"], errors="coerce"), which converts what it can and turns the rest into NaN instead of failing. Then count the NaNs with .isna().sum() and decide how to fill or drop them.

drop_duplicates removed nothing (or too much)

By default it only drops rows that match on every column, so a near-duplicate with one differing cell survives. If you mean "same team," pass a subset: df.drop_duplicates(subset=["team"]). Conversely, if it removed legitimate rows, your duplicate definition is too loose — narrow the subset.

Challenge yourself

Wrap the whole pipeline into a single function, clean_team_stats(path), that reads any similarly messy file and returns a tidy DataFrame. Then make it resilient: instead of assuming exactly one blank cell, have it report how many it filled per column, and add a check that warns if any column is still non-numeric after cleaning. A reusable, self-reporting cleaner is the difference between fixing one file and fixing every file like it.

Get the code

Here's the complete, working script for this tutorial. It runs exactly as shown.

Download the finished script (04_cleaning_messy_sports_data.py)

This script imports a small shared helper (and reads any bundled sample data) that live next to it in /downloads/ — grab these into the same folder so it runs as-is: sdt_common.py, messy_team_stats.csv.

More Foundations tutorials