In the train/test tutorial we held out 25% of the games, scored the home-win classifier on them, and got 67.9%. Honest — but which 25%? Shuffle differently and that number jumps around, because a single test set is itself a small, lucky (or unlucky) sample. K-fold cross-validation fixes that: split the games into k folds, train k times — each time holding out a different fold to test on — and average. Every game gets tested exactly once, so you get a more stable score and a measure of how much it wobbles.

We reuse the tutorial-71 logistic regression and the bundled nba_ratings.csv + nba_home_results.csv, in pure numpy, offline.

Go deeper with the free textbook: Chapter 30: The Machine Learning Workflow at DataField.dev.

Split the games into k folds

Shuffle once with a fixed seed, then chop the shuffled indices into k roughly-equal folds. Each fold will take a turn as the test set.

python

import numpy as np
# ... build x_all (net-rating gap) and y_all (home win) as in tutorial 71 ...

K = 5
rng = np.random.default_rng(73)
idx = rng.permutation(len(x_all))      # shuffle once
folds = np.array_split(idx, K)         # K roughly-equal index folds

That's the entire setup. The trick is in how we loop over the folds next.

Train k times, each holding out one fold

For each fold i, the test set is fold i and the training set is everything else. We reuse the exact logistic-regression training from tutorials 71–72.

python

def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))

def train_and_score(x_tr, y_tr, x_te, y_te):
    w, b, lr = 0.0, 0.0, 0.3
    for _ in range(400):
        p = sigmoid(w*x_tr + b)
        err = p - y_tr
        w -= lr*np.mean(err*x_tr); b -= lr*np.mean(err)
    return float(((sigmoid(w*x_te + b) >= 0.5) == y_te).mean())

scores = []
for i in range(K):
    test_idx  = folds[i]
    train_idx = np.concatenate([folds[j] for j in range(K) if j != i])
    scores.append(train_and_score(x_all[train_idx], y_all[train_idx],
                                  x_all[test_idx],  y_all[test_idx]))

The key line is the train_idx: it stitches together the four folds that aren't the test fold. Five iterations, five independent test scores, and every game has been a test game exactly once.

Read the spread, not just the average
5-fold cross-validation
```
5-fold cross-validation (each game tested exactly once):
  fold 1: 71.3%  (test n=247)
  fold 2: 61.8%  (test n=246)
  fold 3: 66.3%  (test n=246)
  fold 4: 72.4%  (test n=246)
  fold 5: 67.5%  (test n=246)

Mean accuracy: 67.8%
Std across folds: 3.8 points  ->  about 64.1% to 71.6%
A single split could have reported anywhere in that range by luck of the draw.
```
Data: Bundled Basketball-Reference net ratings + 1,231 game results, 5-fold cross-validation, retrieved June 2026

This is the payoff. The five folds range from 61.8% to 72.4% — a 10-point spread — and average 67.8%. The single split in tutorial 72 happened to land at 67.9%, right on the mean, but it just as easily could have reported 62% or 72% depending on the shuffle. Cross-validation gives you the stable middle and tells you the honest uncertainty: this model is about 68% accurate, give or take ~4 points.
Why this is the standard

Reporting a single test number invites two mistakes: bragging about a lucky high split, or panicking over an unlucky low one. The mean of k folds is a far less biased estimate of how the model will do on new data, and the spread across folds is a free uncertainty estimate — no extra data required. That's why “5-fold CV accuracy = 67.8% ± 3.8” is how practitioners actually report a model, rather than “67.9% on my test set.” More folds (k=10) give a slightly more stable mean at more compute; k=5 is the common sweet spot.

Troubleshooting

My folds are wildly different sizes

If your data length isn't divisible by k, np.array_split handles it gracefully (some folds get one extra row). Avoid plain np.split, which errors on uneven splits. The tiny size difference doesn't matter.

One fold's score is far off the others

On small folds (~245 games) a few points of swing is normal sampling noise — that's exactly what cross-validation is revealing. If one fold is wildly off, check that you shuffled before splitting; unshuffled data with any time or team ordering can create a lopsided fold.

Should I retrain or reuse the model across folds?

Retrain from scratch every fold — that's the whole point. Reusing weights leaks information from a fold's test games into its training and inflates the score. Each fold must be trained only on its own four training folds.

Challenge yourself

Try k = 10 and watch the mean barely move while the per-fold spread changes — then push to leave-one-out (k = number of games) and see how expensive it gets. Next, wrap the whole thing in a function and use it to compare two models (one feature vs. two) on the same folds — the fair way to decide if a feature actually helps. Finally, add stratification: make each fold preserve the overall home-win rate, which steadies the estimate further on imbalanced data.

Get the code

Here's the complete, working script for this tutorial. It runs exactly as shown.

This script imports a small shared helper (and reads any bundled sample data) that live next to it in /downloads/ — grab these into the same folder so it runs as-is: sdt_common.py, sdt_nba.py.

K-Fold Cross-Validation: A Score You Can Trust

What you'll build

Split the games into k folds

Train k times, each holding out one fold

Read the spread, not just the average

Why this is the standard

Troubleshooting

Challenge yourself

Get the code

More Basketball tutorials

Pull Your First NBA Data with nba_api

Build a Team Net-Rating Dashboard Table

Draw an NBA Shot Chart with matplotlib