🔗 See the code here: get_fantasy_football_data.ipynb on GitHub
📊 Every prediction starts with data.
Before we can talk machine learning models, feature engineering, or forecasts for the 2025 fantasy football season, we need one thing: a reliable dataset.
That’s what this post is all about — taking messy, scraped stats from Pro-Football-Reference and turning them into something structured, clean, and ready for analysis. This is the foundation of The Predictive Playbook.
🏗️ Why Build My Own Dataset?
Sure, I could download player projections from a site like ESPN or FantasyPros. But as a chemical engineer and data scientist-in-training, I wanted control:
- Consistency – I know exactly where the numbers come from.
- Flexibility – I can create custom features that most sites don’t track.
- Repeatability – the code runs year after year, automatically updating the dataset.
In short: no black boxes, no hidden assumptions — just raw data transformed into something usable.
🔍 Step 1: Scraping the Stats
The notebook starts by pulling NFL player stats from 2010 through 2024. Each season, I collected:
- Passing (yards, touchdowns, interceptions, completion %)
- Rushing (attempts, yards, touchdowns, yards per carry)
- Receiving (targets, catches, yards, touchdowns)
- Miscellaneous (games played, games started, fumbles)
What you get at this stage is basically a giant wall of numbers — every player, every season, all in one place.
🧹 Step 2: Cleaning the Mess
Scraped data is messy by nature. Think:
- Numbers stored as text instead of integers
- Dashes (“—”) where values should be zero
- Duplicate entries when players switched teams mid-season
The notebook fixes all of this by:
- Standardizing column names (
Yds→Passing_Yards) - Converting data types so math actually works
- Handling missing values consistently
- Deduplicating players
It’s not glamorous, but it’s essential. Without clean data, even the best models fall apart.
⚙️ Step 3: Engineering Features
Raw stats are great, but they don’t tell the whole story. I created new columns like:
- Efficiency metrics (yards per attempt, yards per catch)
- Fantasy scoring based on standard league rules
- Lag features like a player’s previous-year stats to capture trends
This is where the dataset shifts from “just stats” to something predictive.
💾 Step 4: Saving for Later
At the end of the pipeline, the cleaned data is saved as an easy-to-open CSV file.
Why a CSV? Because it strikes the right balance:
- Portable – can be opened in Excel, Python, R, or even Google Sheets.
- Simple – no need to rerun the scraper every time I want to build a model.
- Reusable – the same clean dataset can serve as the starting point for multiple projects.
In practice, this means I can skip the heavy lifting of scraping and cleaning next time. Instead, I just load the ready-to-go CSV and jump straight into modeling.
This step may not sound exciting, but it’s one of the most important: it turns a messy scraping project into a repeatable workflow.
That means next time I run a model, I can load exactly what I need in seconds.
📝 What I Learned
- Scraping requires patience — a single website change can break the pipeline.
- Data cleaning is 80% of the work — modeling is the fun 20%.
- Building with reusability in mind pays off — this pipeline will work every season.
🚀 What’s Next
This was all about building the foundation: a dataset I can trust. In the next post, I’ll dive into how I start modeling — using regression and machine learning methods to actually predict fantasy points for 2025.
This is where things start to get really interesting.
➡️ Want to follow the code?
GitHub: Fantasy Football ML Project
