The Predictive Playbook – Building the Dataset: From Raw Stats to Usable Data

🔗 See the code here: get_fantasy_football_data.ipynb on GitHub

📊 Every prediction starts with data.
Before we can talk machine learning models, feature engineering, or forecasts for the 2025 fantasy football season, we need one thing: a reliable dataset.

That’s what this post is all about — taking messy, scraped stats from Pro-Football-Reference and turning them into something structured, clean, and ready for analysis. This is the foundation of The Predictive Playbook.

🏗️ Why Build My Own Dataset?

Sure, I could download player projections from a site like ESPN or FantasyPros. But as a chemical engineer and data scientist-in-training, I wanted control:

Consistency – I know exactly where the numbers come from.
Flexibility – I can create custom features that most sites don’t track.
Repeatability – the code runs year after year, automatically updating the dataset.

In short: no black boxes, no hidden assumptions — just raw data transformed into something usable.

🔍 Step 1: Scraping the Stats

The notebook starts by pulling NFL player stats from 2010 through 2024. Each season, I collected:

Passing (yards, touchdowns, interceptions, completion %)
Rushing (attempts, yards, touchdowns, yards per carry)
Receiving (targets, catches, yards, touchdowns)
Miscellaneous (games played, games started, fumbles)

What you get at this stage is basically a giant wall of numbers — every player, every season, all in one place.

🧹 Step 2: Cleaning the Mess

Scraped data is messy by nature. Think:

Numbers stored as text instead of integers
Dashes (“—”) where values should be zero
Duplicate entries when players switched teams mid-season

The notebook fixes all of this by:

Standardizing column names (Yds → Passing_Yards)
Converting data types so math actually works
Handling missing values consistently
Deduplicating players

It’s not glamorous, but it’s essential. Without clean data, even the best models fall apart.

⚙️ Step 3: Engineering Features

Raw stats are great, but they don’t tell the whole story. I created new columns like:

Efficiency metrics (yards per attempt, yards per catch)
Fantasy scoring based on standard league rules
Lag features like a player’s previous-year stats to capture trends

This is where the dataset shifts from “just stats” to something predictive.

💾 Step 4: Saving for Later

At the end of the pipeline, the cleaned data is saved as an easy-to-open CSV file.

Why a CSV? Because it strikes the right balance:

Portable – can be opened in Excel, Python, R, or even Google Sheets.
Simple – no need to rerun the scraper every time I want to build a model.
Reusable – the same clean dataset can serve as the starting point for multiple projects.

In practice, this means I can skip the heavy lifting of scraping and cleaning next time. Instead, I just load the ready-to-go CSV and jump straight into modeling.

This step may not sound exciting, but it’s one of the most important: it turns a messy scraping project into a repeatable workflow.

That means next time I run a model, I can load exactly what I need in seconds.

📝 What I Learned

Scraping requires patience — a single website change can break the pipeline.
Data cleaning is 80% of the work — modeling is the fun 20%.
Building with reusability in mind pays off — this pipeline will work every season.

🚀 What’s Next

This was all about building the foundation: a dataset I can trust. In the next post, I’ll dive into how I start modeling — using regression and machine learning methods to actually predict fantasy points for 2025.

This is where things start to get really interesting.

➡️ Want to follow the code?
GitHub: Fantasy Football ML Project