A data project driven by a love for golf, statistics, and a curiosity: can data outsmart the bookmakers?
From Tee Box to Text Editor
I used to play competitive golf, representing Kuala Lumpur, Malaysia’s capital, at the national level. I wasn’t the biggest hitter or the most naturally gifted - I was nearly a foot shorter than the average player in my age group (thanks to a late growth spurt and some unfortunate date cutoffs). So I leaned into what I could control: statistics-based preparation and deliberate course management.
That edge took me further than talent alone might have.
These days, I’m no longer grinding out 36-hole weekends, but my love for golf - and for data - hasn’t gone anywhere. Lately, that’s taken a new form: sports betting. Nothing serious - just small-stakes wagers with friends - but over the past year, I’ve seen a surprising 150% return.
I’ll be the first to admit: a good portion of that was luck. But it got me wondering…
What would it take to build a model that doesn’t just predict results - but consistently outperforms the bookmakers’ implied odds?
Where Do You Even Start?
Like any data scientist, I started with the data.
Sites like DataGolf and FantasyNational, and others frequently cited on Reddit were promising - but behind paywalls. That left me with one major free alternative: the official PGA Tour statistics site.
It’s a goldmine, offering data back to 2004 - but also a logistical nightmare. Every CSV must be downloaded manually, one at a time, by selecting a specific stat, for a specific tournament.
To put this in perspective:
- ~350 stats per tournament
- ~39 events per season
- That’s over 13,650 CSV files for a single year
There weren’t any public scrapers available. Most users either paid for third-party access or gave up.
So I Built One
To solve this, I built a web scraper that automates collection of all ‘Tournament Only’ statistics from the PGA Tour website - across any season from 2004 to the present.
It enables flexible, large-scale data collection without needing manual intervention.
Key Features
Scrape across any date range or season between 2004 and present
Select specific metrics using the Stat Code Reference List e.g.
02564
= Strokes Gained: PuttingExport clean
.csv
files, organised by:- Stat category
- Season or date range
Run as either:
- Python script (CLI)
- Jupyter Notebook (interactive)
In addition, the GitHub repository includes fully pre-scraped data covering 01/01/2004 – 10/05/2025, so you can start exploring immediately - no scraping required.
Why This Matters
There’s growing interest in using data to inform sports predictions - but too often, the data itself is locked away, incomplete, or poorly documented. By making this tool (and the data) openly available, I hope to support others who want to:
- Build predictive models
- Investigate performance trends
- Analyse course-specific player stats
- Create more transparent, hands-on sports analytics projects
In short, this is about lowering the barrier to entry - and encouraging exploration.
What’s Next?
This project lays the groundwork by solving the data access problem - but the real challenge (and excitement) lies ahead. My next step will focus on building predictive models using this dataset, exploring whether performance metrics can reliably anticipate outcomes - and how closely those outcomes align (or diverge) from bookmaker odds.
Stay tuned for the modelling phase.
Explore the GitHub Repository
(Includes full codebase, setup instructions, stat references, and a complete pre-scraped dataset)
Saved some money using this instead of buying data elsewhere?
Consider supporting via Buy Me a Coffee - thank you!