The Predictive Playbook – Building the Dataset: From Raw Stats to Usable Data

Posted by:

|

On:

|

🔗 See the code here: get_fantasy_football_data.ipynb on GitHub

📊 Every prediction starts with data.
Before we can talk machine learning models, feature engineering, or forecasts for the 2025 fantasy football season, we need one thing: a reliable dataset.

That’s what this post is all about — taking messy, scraped stats from Pro-Football-Reference and turning them into something structured, clean, and ready for analysis. This is the foundation of The Predictive Playbook.


🏗️ Why Build My Own Dataset?

Sure, I could download player projections from a site like ESPN or FantasyPros. But as a chemical engineer and data scientist-in-training, I wanted control:

  • Consistency – I know exactly where the numbers come from.
  • Flexibility – I can create custom features that most sites don’t track.
  • Repeatability – the code runs year after year, automatically updating the dataset.

In short: no black boxes, no hidden assumptions — just raw data transformed into something usable.


🔍 Step 1: Scraping the Stats

The notebook starts by pulling NFL player stats from 2010 through 2024. Each season, I collected:

  • Passing (yards, touchdowns, interceptions, completion %)
  • Rushing (attempts, yards, touchdowns, yards per carry)
  • Receiving (targets, catches, yards, touchdowns)
  • Miscellaneous (games played, games started, fumbles)

What you get at this stage is basically a giant wall of numbers — every player, every season, all in one place.


🧹 Step 2: Cleaning the Mess

Scraped data is messy by nature. Think:

  • Numbers stored as text instead of integers
  • Dashes (“—”) where values should be zero
  • Duplicate entries when players switched teams mid-season

The notebook fixes all of this by:

  • Standardizing column names (YdsPassing_Yards)
  • Converting data types so math actually works
  • Handling missing values consistently
  • Deduplicating players

It’s not glamorous, but it’s essential. Without clean data, even the best models fall apart.


⚙️ Step 3: Engineering Features

Raw stats are great, but they don’t tell the whole story. I created new columns like:

  • Efficiency metrics (yards per attempt, yards per catch)
  • Fantasy scoring based on standard league rules
  • Lag features like a player’s previous-year stats to capture trends

This is where the dataset shifts from “just stats” to something predictive.


💾 Step 4: Saving for Later

At the end of the pipeline, the cleaned data is saved as an easy-to-open CSV file.

Why a CSV? Because it strikes the right balance:

  • Portable – can be opened in Excel, Python, R, or even Google Sheets.
  • Simple – no need to rerun the scraper every time I want to build a model.
  • Reusable – the same clean dataset can serve as the starting point for multiple projects.

In practice, this means I can skip the heavy lifting of scraping and cleaning next time. Instead, I just load the ready-to-go CSV and jump straight into modeling.

This step may not sound exciting, but it’s one of the most important: it turns a messy scraping project into a repeatable workflow.

That means next time I run a model, I can load exactly what I need in seconds.


📝 What I Learned

  1. Scraping requires patience — a single website change can break the pipeline.
  2. Data cleaning is 80% of the work — modeling is the fun 20%.
  3. Building with reusability in mind pays off — this pipeline will work every season.

🚀 What’s Next

This was all about building the foundation: a dataset I can trust. In the next post, I’ll dive into how I start modeling — using regression and machine learning methods to actually predict fantasy points for 2025.

This is where things start to get really interesting.


➡️ Want to follow the code?
GitHub: Fantasy Football ML Project

Posted by

in