K-Fold Cross-Validation: A Score You Can Trust

BasketballAdvancedPython~5 min read

What you'll build

Replace tutorial 72's single train/test split with 5-fold cross-validation - train five times, test on every game exactly once - to get a stable accuracy estimate (67.8%) and, just as important, its spread across folds.

Replace tutorial 72's single train/test split with 5-fold cross-validation - train five times, test on every game exactly once - to get a stable accuracy estimate (67.8%) and, just as important, its spread across folds.
Data: Bundled Basketball-Reference net ratings + 1,231 game results, 5-fold cross-validation, retrieved June 2026

In the train/test tutorial we held out 25% of the games, scored the home-win classifier on them, and got 67.9%. Honest — but which 25%? Shuffle differently and that number jumps around, because a single test set is itself a small, lucky (or unlucky) sample. K-fold cross-validation fixes that: split the games into k folds, train k times — each time holding out a different fold to test on — and average. Every game gets tested exactly once, so you get a more stable score and a measure of how much it wobbles.

We reuse the tutorial-71 logistic regression and the bundled nba_ratings.csv + nba_home_results.csv, in pure numpy, offline.

Go deeper with the free textbook: Chapter 30: The Machine Learning Workflow at DataField.dev.

  1. Split the games into k folds

    Shuffle once with a fixed seed, then chop the shuffled indices into k roughly-equal folds. Each fold will take a turn as the test set.

    python
    import numpy as np
    # ... build x_all (net-rating gap) and y_all (home win) as in tutorial 71 ...
    
    K = 5
    rng = np.random.default_rng(73)
    idx = rng.permutation(len(x_all))      # shuffle once
    folds = np.array_split(idx, K)         # K roughly-equal index folds

    That's the entire setup. The trick is in how we loop over the folds next.

  2. Train k times, each holding out one fold

    For each fold i, the test set is fold i and the training set is everything else. We reuse the exact logistic-regression training from tutorials 71–72.

    python
    def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))
    
    def train_and_score(x_tr, y_tr, x_te, y_te):
        w, b, lr = 0.0, 0.0, 0.3
        for _ in range(400):
            p = sigmoid(w*x_tr + b)
            err = p - y_tr
            w -= lr*np.mean(err*x_tr); b -= lr*np.mean(err)
        return float(((sigmoid(w*x_te + b) >= 0.5) == y_te).mean())
    
    scores = []
    for i in range(K):
        test_idx  = folds[i]
        train_idx = np.concatenate([folds[j] for j in range(K) if j != i])
        scores.append(train_and_score(x_all[train_idx], y_all[train_idx],
                                      x_all[test_idx],  y_all[test_idx]))

    The key line is the train_idx: it stitches together the four folds that aren't the test fold. Five iterations, five independent test scores, and every game has been a test game exactly once.

  3. Read the spread, not just the average

    5-fold cross-validation
    5-fold cross-validation (each game tested exactly once):
      fold 1: 71.3%  (test n=247)
      fold 2: 61.8%  (test n=246)
      fold 3: 66.3%  (test n=246)
      fold 4: 72.4%  (test n=246)
      fold 5: 67.5%  (test n=246)
    
    Mean accuracy: 67.8%
    Std across folds: 3.8 points  ->  about 64.1% to 71.6%
    A single split could have reported anywhere in that range by luck of the draw.
    Bar chart of accuracy for each of five folds ranging from 61.8% to 72.4%, with a mean line at 67.8% and a shaded plus-or-minus one standard deviation band
    Data: Bundled Basketball-Reference net ratings + 1,231 game results, 5-fold cross-validation, retrieved June 2026

    This is the payoff. The five folds range from 61.8% to 72.4% — a 10-point spread — and average 67.8%. The single split in tutorial 72 happened to land at 67.9%, right on the mean, but it just as easily could have reported 62% or 72% depending on the shuffle. Cross-validation gives you the stable middle and tells you the honest uncertainty: this model is about 68% accurate, give or take ~4 points.

  4. Why this is the standard

    Reporting a single test number invites two mistakes: bragging about a lucky high split, or panicking over an unlucky low one. The mean of k folds is a far less biased estimate of how the model will do on new data, and the spread across folds is a free uncertainty estimate — no extra data required. That's why “5-fold CV accuracy = 67.8% ± 3.8” is how practitioners actually report a model, rather than “67.9% on my test set.” More folds (k=10) give a slightly more stable mean at more compute; k=5 is the common sweet spot.

Troubleshooting

My folds are wildly different sizes

If your data length isn't divisible by k, np.array_split handles it gracefully (some folds get one extra row). Avoid plain np.split, which errors on uneven splits. The tiny size difference doesn't matter.

One fold's score is far off the others

On small folds (~245 games) a few points of swing is normal sampling noise — that's exactly what cross-validation is revealing. If one fold is wildly off, check that you shuffled before splitting; unshuffled data with any time or team ordering can create a lopsided fold.

Should I retrain or reuse the model across folds?

Retrain from scratch every fold — that's the whole point. Reusing weights leaks information from a fold's test games into its training and inflates the score. Each fold must be trained only on its own four training folds.

Challenge yourself

Try k = 10 and watch the mean barely move while the per-fold spread changes — then push to leave-one-out (k = number of games) and see how expensive it gets. Next, wrap the whole thing in a function and use it to compare two models (one feature vs. two) on the same folds — the fair way to decide if a feature actually helps. Finally, add stratification: make each fold preserve the overall home-win rate, which steadies the estimate further on imbalanced data.

Get the code

Here's the complete, working script for this tutorial. It runs exactly as shown.

Download the finished script (73_kfold_cross_validation.py)

This script imports a small shared helper (and reads any bundled sample data) that live next to it in /downloads/ — grab these into the same folder so it runs as-is: sdt_common.py, sdt_nba.py.

More Basketball tutorials

A current-standings DataFrame from nba_api, with the proper headers baked in.
Basketball Beginner

Pull Your First NBA Data with nba_api

Pull NBA standings with nba_api, with the browser headers and retry logic stats.nba.com demands. Includes exactly what to do when the endpoint refuses to answer.

~9 min
A ranked net-rating table styled like a real dashboard, exported as an image.
Basketball Intermediate

Build a Team Net-Rating Dashboard Table

Combine offensive and defensive ratings into a ranked net-rating table, then style it into a dashboard-quality figure you can drop into a report.

~8 min
A half-court drawn in matplotlib with a player's makes and misses plotted on it.
Basketball Intermediate

Draw an NBA Shot Chart with matplotlib

Draw a regulation half-court from scratch in matplotlib, then plot a player's makes and misses in court coordinates for a real, shareable shot chart.

~10 min