PGA Tour Stats Scraper

This project centres on building a fully automated web scraper that collects tournament-level statistics from the official PGA Tour website, covering data from 2004 to the present. It overcomes the platform’s manual, one-stat-at-a-time download limitation by enabling users to extract structured, high-quality .csv datasets across any year or date range - with support for all available stat codes. The tool is available as both a Python script and an interactive Jupyter Notebook, and the repository also includes a complete pre-scraped dataset (2004–2025) for immediate use.

11 May 2025 · Shaun Yap
Trump's Tariff Formula

Understanding Trump's Tariff Formula

On April 2, 2025, President Donald Trump declared ‘Liberation Day’, unveiling aggressive new tariffs designed to correct trade imbalances via a controversial formula from the U.S. Trade Representative (USTR). This tariff equation, which aims to achieve a bilateral trade balance of zero, adjusts rates based on export-import disparities, elasticity of demand, and tariff passthrough. This article focuses on explaining the Trump Tariff formula to all with worked examples and offering Python tools to replicate and validate the data.

05 April 2025 · Shaun Yap

Evaluating Environment & Climate Truthfulness in Social Media using Deep Learning & Large Language Models (LLMs)

Awarded Best Dissertation in Cohort, this MSc project explores the detection of climate and environmental misinformation on social media using a comparative framework of traditional natural language processing techniques, deep learning, and Large Language Models (LLMs). Leveraging a web-scraped dataset from PolitiFact, the study highlights the superiority of CNNs trained on ordinal truthfulness data, with accuracy boosted from 80.1% to 84.0% through GPT-4o-driven feature augmentation. While LLMs enhanced contextual understanding and sentiment analysis, their time complexity posed practical limitations. The project contributes novel insights into model performance trade-offs, evaluation metrics tailored to ordinal classification, and the practical integration of LLMs for misinformation mitigation in climate discourse.

01 September 2024 · Shaun Yap

Kaggle Competition - Flood Prediction EDA

Rigorous exploratory analysis of a large-scale (>1,000,000 training datapoints) Kaggle flood prediction dataset. It highlights strong skills in handling high-dimensional structured data, performing scalable EDA with efficient visualisations, and applying both statistical and machine learning techniques for insight generation. Key competencies include dimensionality reduction (PCA, UMAP, t-SNE), correlation analysis, feature distribution comparison, and model interpretation using scikit-learn and statsmodels. The project also showcases custom cross-validation tooling, effective use of pipelines for reproducible modelling, and the derivation of a simplified additive model, reflecting a deep understanding of linear structures in high-volume data.

03 May 2024 · Shaun Yap

Clustering Countries

This project showcases advanced data science and statistical analysis skills through a comprehensive clustering analysis of country-level socio-economic and health data using both hierarchical and k-means clustering methods. Key skills demonstrated include robust data preprocessing, detailed exploratory data analysis with insightful visualisations, Z-score standardisation, PCA for dimensionality reduction, and effective interpretation of cluster structures. The analysis incorporates evaluation metrics such as silhouette and Calinski-Harabasz scores for optimal cluster selection, uses distance metrics (Manhattan and Euclidean), and applies cluster-based inference to identify countries in need of development aid. Additionally, the project integrates model application to new data, creating a prioritised aid strategy using PCA projections and quantitative scoring.

10 January 2024 · Shaun Yap