Merging and Joining Two Datasets with pandas

FoundationsBeginnerPython~5 min read

What you'll build

Two separate tables joined on a shared key into one, then charted.

Two separate tables joined on a shared key into one, then charted.
Data: Bundled sample (2023 MLB standings), retrieved June 2026

A team's win-loss record comes from a standings feed. Its runs scored and allowed might come from a different endpoint. Salaries, attendance, and advanced stats each live somewhere else again. Real sports data almost never arrives in one tidy table, and the skill that stitches the pieces together — the join, which pandas spells merge — is the most useful move I know after loading a CSV. So I'll take two separate tables, join them on a shared key, and then build a chart from a column that existed in neither one alone.

This builds on Pandas for Sports Data. To keep it offline and concrete, we'll split the bundled sample_standings.csv (real 2023 MLB standings) into two tables and join them back — exactly the shape of the real problem, where wins and runs come from different places.

  1. Two tables, one shared key

    Imagine these came from two sources: a record feed with wins and losses, and a scoring feed with runs. Both identify teams by name — that shared Team column is the key we'll join on.

    python
    import pandas as pd
    
    full = pd.read_csv("sample_standings.csv")
    record = full[["Team", "Abbr", "W", "L"]]      # pretend: from a standings feed
    scoring = full[["Team", "RS", "RA"]]           # pretend: from a scoring feed

    A join only works when both tables share a column whose values line up. Here it's the team name; in the wild it might be a team ID, a player ID, or a date.

  2. Merge on the key

    merge matches rows whose key agrees and glues their columns together. The how argument decides what to do with rows that don't match: "inner" (the default) keeps only teams in both tables; "left" keeps every row of the left table; "outer" keeps everything.

    python
    merged = record.merge(scoring, on="Team", how="inner")
    merged["RunDiff"] = merged["RS"] - merged["RA"]
    print(merged.sort_values("RunDiff", ascending=False).head().to_string())
    Two source tables, then the joined result
    record table (one source):
          Team Abbr    W   L
    0   Braves  ATL  104  58
    1  Orioles  BAL  101  61
    2  Dodgers  LAD  100  62
    
    scoring table (another source):
          Team   RS   RA
    0   Braves  947  716
    1  Orioles  807  678
    2  Dodgers  906  699
    
    merged on 'Team', with a computed RunDiff:
          Team    W   RS   RA  RunDiff
    0   Braves  104  947  716      231
    2  Dodgers  100  906  699      207
    3     Rays   99  860  665      195
    7  Rangers   90  881  716      165
    5   Astros   90  827  698      129

    That last column is the whole point: RunDiff needed RS and RA from the scoring table sitting next to a row we could read — something neither table could give us alone. The 2023 Braves top the joined table at +231, the Dodgers next at +207.

  3. Chart the joined data

    With both feeds in one frame, a grouped bar chart puts runs scored and allowed side by side for the best run-differential teams.

    python
    import numpy as np
    import matplotlib.pyplot as plt
    
    top = merged.sort_values("RunDiff", ascending=False).head(8)
    x = np.arange(len(top)); w = 0.4
    fig, ax = plt.subplots(figsize=(9, 5))
    ax.bar(x - w/2, top["RS"], w, label="runs scored")
    ax.bar(x + w/2, top["RA"], w, label="runs allowed")
    ax.set_xticks(x); ax.set_xticklabels(top["Abbr"]); ax.legend()
    fig.savefig("merged_bars.png", dpi=144, bbox_inches="tight")
    Grouped bar chart of runs scored versus runs allowed for the eight MLB teams with the best run differential in 2023, scored bars clearly taller than allowed bars
    Data: Bundled sample (2023 MLB standings), retrieved June 2026

    Every one of these teams shows a taller "scored" bar than "allowed" bar — which is exactly what a positive run differential means, now visible at a glance because the join let us draw both series together.

Troubleshooting

The merged table is missing rows I expected

An inner join drops any key that isn't in both tables. If one feed spells a team differently ("St. Louis" vs "St Louis") the keys don't match and those rows vanish. Inspect the keys on both sides, standardize spelling first, or use how="left" and look for the resulting NaNs to see exactly what failed to match.

I get columns named RS_x and RS_y

Both tables had a column with the same name (other than the key), so pandas disambiguated them with suffixes. Either drop the duplicate before merging, or pass suffixes=("_record", "_scoring") to label them meaningfully.

Rows multiplied after the merge

The key wasn't unique on one side, so every match was paired with every duplicate (a many-to-many join). Check with df["Team"].duplicated().any() and de-duplicate, or pass validate="one_to_one" to make pandas raise instead of silently exploding the row count.

Challenge yourself

Make a third tiny table by hand — say a DataFrame of just a few teams' division titles — and left-merge it onto merged. Watch how teams without a row in your small table get NaN, and how an inner join would have dropped them instead. Then try how="outer" and count the rows to prove to yourself which join keeps the most.

Get the code

Here's the complete, working script for this tutorial. It runs exactly as shown.

Download the finished script (44_merge_and_join_datasets_with_pandas.py)

This script imports a small shared helper (and reads any bundled sample data) that live next to it in /downloads/ — grab these into the same folder so it runs as-is: sdt_common.py.

More Foundations tutorials