Kaggle Competition - Flood Prediction EDA

Link to the Kaggle Competition

https://www.kaggle.com/competitions/playground-series-s4e5/overview

# Standard library
import datetime
from collections import defaultdict

# Data manipulation
import numpy as np
import pandas as pd

# Visualisation
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns

# Machine learning and modelling
from sklearn.base import clone
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor  # Optional: if used later
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, SplineTransformer
from sklearn.metrics import r2_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# UMAP for dimensionality reduction
import umap

# Statistical modelling
import statsmodels.api as sm

Loading the Data

We begin by loading the dataset and observing its structure. The training data consists of over one million rows and twenty features.

Key Insights:

Given the dataset’s size, it’s important to select algorithms that scale efficiently with large volumes of data. Methods reliant on distance computations (e.g., k-nearest neighbors) or those involving kernel matrices may not be computationally feasible.
To optimise training time, especially during experimentation, it may be practical to use a simple train–test split instead of full cross-validation. While five-fold cross-validation offers robust performance estimates, it can be prohibitively time-consuming at this scale.

train = pd.read_csv('train.csv', index_col='id')
test = pd.read_csv('test.csv', index_col='id')

initial_features = list(test.columns)
train

	MonsoonIntensity	TopographyDrainage	RiverManagement	Deforestation	Urbanization	ClimateChange	DamsQuality	Siltation	AgriculturalPractices	Encroachments	...	DrainageSystems	CoastalVulnerability	Landslides	Watersheds	DeterioratingInfrastructure	PopulationScore	WetlandLoss	InadequatePlanning	PoliticalFactors	FloodProbability
id
0	5	8	5	8	6	4	4	3	3	4	...	5	3	3	5	4	7	5	7	3	0.445
1	6	7	4	4	8	8	3	5	4	6	...	7	2	0	3	5	3	3	4	3	0.450
2	6	5	6	7	3	7	1	5	4	5	...	7	3	7	5	6	8	2	3	3	0.530
3	3	4	6	5	4	8	4	7	6	8	...	2	4	7	4	4	6	5	7	5	0.535
4	5	3	2	6	4	4	3	3	3	3	...	2	2	6	6	4	1	2	3	5	0.415
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1117952	3	3	4	10	4	5	5	7	10	4	...	7	8	7	2	2	1	4	6	4	0.495
1117953	2	2	4	3	9	5	8	1	3	5	...	9	4	4	3	7	4	9	4	5	0.480
1117954	7	3	9	4	6	5	9	1	3	4	...	5	5	5	5	6	5	5	2	4	0.485
1117955	7	3	3	7	5	2	3	4	6	4	...	6	8	5	3	4	6	7	6	4	0.495
1117956	4	5	6	9	5	5	2	8	4	5	...	4	8	6	5	5	6	7	7	8	0.560

1117957 rows × 21 columns

Target Variable: `FloodProbability` Distribution

The target variable, FloodProbability, ranges from 0.285 to 0.725. All values are discrete and occur in increments of 0.005, suggesting that the target has been finely quantised, likely as a result of post-processing or model calibration.

To visualise the distribution, we plot a histogram using a bin width of 0.005, ensuring that each unique target value is represented in its own bin. The resulting distribution is symmetric and closely resembles a normal distribution centered around 0.5, indicating a well-balanced target variable.

bin_edges = np.arange(0.2825, 0.7300, 0.005)

plt.figure(figsize=(8, 3))
plt.hist(train['FloodProbability'], bins=bin_edges, density=True, edgecolor='black', linewidth=0.05)
plt.title('Distribution of FloodProbability')
plt.xlabel('FloodProbability')
plt.ylabel('Probability Mass')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

png

Feature Distributions

All features in the dataset are integer-valued and exhibit right-skewed distributions, with the majority of values concentrated near the lower end of the range. Visual comparison of the training and test sets reveals that their feature distributions are virtually identical. Overlapping histograms show no meaningful divergence - if discrepancies existed, the bars representing each set (plotted in different colors) would be clearly distinguishable.

import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

# Plot feature distributions for train vs test
fig, axs = plt.subplots(5, 4, figsize=(14, 12))  # Slightly wider for better spacing

for col, ax in zip(initial_features, axs.ravel()):
    # Relative frequency in training set
    train_dist = train[col].value_counts(normalize=True).sort_index()
    ax.bar(train_dist.index, train_dist.values, label='Train', alpha=1.0)

    # Relative frequency in test set
    test_dist = test[col].value_counts(normalize=True).sort_index()
    ax.bar(test_dist.index, test_dist.values, label='Test', alpha=0.6)

    ax.set_title(col)
    ax.xaxis.set_major_locator(MaxNLocator(integer=True))
    ax.set_ylabel('Proportion')

# Add legend only once, outside the grid
handles, labels = ax.get_legend_handles_labels()
fig.legend(handles, labels, loc='upper right', fontsize='medium')

plt.tight_layout()
plt.show()

png

Correlation

The correlation matrix reveals several interesting patterns:

There is virtually no linear correlation among the features - the pairwise correlation coefficients are all extremely close to zero.
However, each feature exhibits a noticeable correlation with the target variable FloodProbability, suggesting that while features may be mutually uncorrelated, they individually contribute predictive information.

A subtle and intriguing detail emerges from the correlation heatmap: none of the feature-feature correlations are exactly 0.0; instead, all are -0.0. While this might appear negligible, the consistent sign indicates that every pair of features is negatively correlated, albeit at an imperceptibly small scale. This observation hints at a form of weak statistical dependence across the feature set, despite the absence of meaningful linear correlation.

# Prepare the list of features including the target
corr_features = initial_features + ['FloodProbability']

# Compute correlation matrix
correlation_matrix = np.corrcoef(train[corr_features], rowvar=False)

# Plot heatmap
plt.figure(figsize=(11, 11))
sns.heatmap(
    correlation_matrix,
    center=0,
    cmap='coolwarm',
    annot=True,
    fmt='.1f',
    xticklabels=corr_features,
    yticklabels=corr_features,
    square=True,
    linewidths=0.5,
    cbar_kws={'shrink': 0.8}
)
plt.title('Correlation Matrix of Features and Target', fontsize=14)
plt.tight_layout()
plt.show()

png

Dimensionality Reduction

Principal Component Analysis (PCA)

To assess the intrinsic dimensionality of the dataset, we apply Principal Component Analysis (PCA). The cumulative explained variance curve shows a gradual increase across components, indicating that the variance is distributed relatively evenly across all dimensions. This suggests that the dataset is effectively full-rank, and no small subset of principal components captures a dominant share of the variance.

In other words, linear dimensionality reduction would not be beneficial, as each feature contributes unique information that cannot be easily compressed without significant loss of signal.

# Fit PCA on the feature set
pca = PCA()
pca.fit(train[initial_features])

# Plot cumulative explained variance
plt.figure(figsize=(4, 3))
plt.plot(np.arange(1, len(initial_features) + 1), pca.explained_variance_ratio_.cumsum(), marker='o')
plt.title('Principal Component Analysis', fontsize=12)
plt.xlabel('Principal Component', fontsize=10)
plt.ylabel('Cumulative Explained Variance', fontsize=10)
plt.xticks(fontsize=9)
plt.yticks([0, 0.5, 1.0], fontsize=9)
plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

png

UMAP

We apply Unsupervised Uniform Manifold Approximation and Projection (UMAP) to reduce the high-dimensional feature space to two dimensions. UMAP is particularly useful for visualising structure, identifying clusters, or uncovering non-linear patterns that may not be evident in the raw feature space.

However, in this case, the 2D projection does not reveal any clear structure or separation with respect to the target variable, FloodProbability. The data appears uniformly scattered, suggesting that UMAP does not expose any obvious low-dimensional manifold in this dataset.

# Function to plot the embedding
def plot_embedding(embedding, target, title):
    plt.figure(figsize=(6, 5))
    plt.scatter(
        embedding[:, 0],
        embedding[:, 1],
        s=2,
        c=target,
        cmap='coolwarm',
        alpha=0.6
    )
    plt.gca().set_aspect('equal', 'datalim')
    plt.title(title, fontsize=14)
    plt.xlabel('UMAP Dimension 1')
    plt.ylabel('UMAP Dimension 2')
    plt.colorbar(label='FloodProbability')
    plt.tight_layout()
    plt.show()

# Sample a subset for faster projection
train_sample = train.sample(10_000)

# Fit UMAP
reducer = umap.UMAP()
embedding = reducer.fit_transform(train_sample[initial_features])

# Plot the result
plot_embedding(
    embedding,
    train_sample['FloodProbability'],
    'Unsupervised UMAP Projection of the Training Dataset'
)

png

Supervised UMAP Projection

We also experiment with a supervised version of UMAP, which incorporates the target variable (FloodProbability) during the embedding process. This often enhances the separation of data based on the target and can reveal useful structure when relationships exist.

In this case, the resulting 2D projection does appear more organised visually. However, the structure largely reflects the quantised nature of the target, which consists of 83 discrete values. The apparent clustering is therefore an artifact of the target discretization rather than any meaningful data-driven separation. As a result, the supervised UMAP does not offer any additional insight beyond what was already known.

# Sample a smaller subset for faster computation
train_sample = train.sample(10000, random_state=42)

# Fit Supervised UMAP using regression target
reducer = umap.UMAP(
    n_neighbors=50,         # smaller neighborhood for faster computation
    min_dist=0.5,           # moderate compression
    target_metric='manhattan',
    target_weight=0.6,
)

# Perform the supervised projection
embedding = reducer.fit_transform(
    train_sample[initial_features],
    y=train_sample['FloodProbability']
)

# Plotting function
def plot_embedding(embedding, target, title):
    plt.figure(figsize=(6, 5))
    scatter = plt.scatter(
        embedding[:, 0],
        embedding[:, 1],
        s=2,
        c=target,
        cmap='coolwarm',
        alpha=0.7
    )
    plt.gca().set_aspect('equal', 'datalim')
    plt.title(title, fontsize=14)
    plt.xlabel('UMAP Dimension 1')
    plt.ylabel('UMAP Dimension 2')
    cbar = plt.colorbar(scatter)
    cbar.set_label('FloodProbability')
    plt.tight_layout()
    plt.show()

# Visualise the result
plot_embedding(
    embedding,
    train_sample['FloodProbability'],
    'Supervised UMAP Projection of the Training Dataset'
)

png

t-distributed Stochastic Neighbor Embedding (t-SNE)

We also apply t-distributed Stochastic Neighbor Embedding (t-SNE) to project the high-dimensional feature space into two dimensions. t-SNE is a popular non-linear technique that is particularly effective at capturing local structure and revealing clusters.

In this case, however, the t-SNE projection does not uncover any meaningful patterns or structure in relation to the target variable, FloodProbability. The visualisation resembles random noise, indicating that t-SNE, like UMAP, fails to reveal a low-dimensional manifold in this dataset. Thus, t-SNE offers no additional insight for exploratory purposes.

# Sample a subset for computational efficiency
train_sample = train.sample(20_000, random_state=42)

# Fit and transform with t-SNE
reducer = TSNE(random_state=42, perplexity=30, max_iter=1000, learning_rate='auto')
embedding = reducer.fit_transform(train_sample[initial_features])

# Plot the result
plot_embedding(
    embedding,
    train_sample['FloodProbability'],
    '(Unsupervised) t-SNE Projection of the Training Dataset'
)

png

Cross-Validation Strategy

To ensure consistency and reproducibility across all model evaluations, we define a unified cross-validation function that standardises how models are trained and validated.

For efficiency during early experimentation, we evaluate performance using only the first fold of the five-fold cross-validation split. This approach significantly reduces computation time while providing a reasonable estimate of model performance. If higher precision is required later in the modeling pipeline, we can easily enable all five folds by adjusting a single flag.

from collections import defaultdict

# Configuration
kf = KFold(n_splits=5, shuffle=True, random_state=1)
SINGLE_FOLD = True       # Toggle for full vs. single-fold CV
COMPUTE_TEST_PRED = True # Toggle to compute test set predictions
oof = defaultdict(lambda: np.full_like(train.FloodProbability, np.nan, dtype=float))
test_pred = {}

def cross_validate(model, label, features=initial_features, n_repeats=1):
    """
    Evaluate a model using cross-validation.
    
    Parameters:
    - model: scikit-learn compatible regressor
    - label: string identifier for the model
    - features: list of feature names to use
    - n_repeats: number of training repeats with different random seeds

    Outputs:
    - Stores out-of-fold predictions in `oof[label]`
    - Stores averaged test predictions in `test_pred[label]` if enabled
    """
    start_time = datetime.datetime.now()
    scores = []
    oof_preds = np.full_like(train.FloodProbability, np.nan, dtype=float)

    for fold, (idx_tr, idx_va) in enumerate(kf.split(train)):
        X_tr = train.iloc[idx_tr][features]
        X_va = train.iloc[idx_va][features]
        y_tr = train.iloc[idx_tr].FloodProbability
        y_va = train.iloc[idx_va].FloodProbability

        y_pred = np.zeros_like(y_va, dtype=float)

        for i in range(n_repeats):
            m = clone(model)
            if n_repeats > 1:
                mm = m
                if isinstance(mm, Pipeline):
                    mm = mm[-1]
                if isinstance(mm, TransformedTargetRegressor):
                    mm = mm.regressor
                mm.set_params(random_state=i)

            m.fit(X_tr, y_tr)
            y_pred += m.predict(X_va)

        y_pred /= n_repeats
        score = r2_score(y_va, y_pred)
        print(f"# Fold {fold}: R² = {score:.5f}")
        scores.append(score)
        oof_preds[idx_va] = y_pred

        if SINGLE_FOLD:
            break

    elapsed_time = datetime.datetime.now() - start_time
    avg_score = np.mean(scores)
    print(f"{Fore.GREEN}# Overall R²: {avg_score:.5f} | Model: {label}"
          f"{' (single fold)' if SINGLE_FOLD else ''} "
          f"| Time: {int(np.round(elapsed_time.total_seconds() / 60))} min{Style.RESET_ALL}")
    oof[label] = oof_preds

    # Optional: generate test set predictions
    if COMPUTE_TEST_PRED:
        X_tr = train[features]
        y_tr = train.FloodProbability
        y_pred = np.zeros(len(test), dtype=float)

        for i in range(n_repeats):
            m = clone(model)
            if n_repeats > 1:
                mm = m
                if isinstance(mm, Pipeline):
                    mm = mm[-1]
                if isinstance(mm, TransformedTargetRegressor):
                    mm = mm.regressor
                mm.set_params(random_state=i)

            m.fit(X_tr, y_tr)
            y_pred += m.predict(test[features])

        y_pred /= n_repeats
        test_pred[label] = y_pred

Linear Models

We begin our modeling process with linear regression approaches, using various transformations to capture potential non-linearities in the data.

A basic linear regression with standardised features provides a strong baseline.
Adding polynomial features improves performance slightly, suggesting that some quadratic interactions among features are predictive.
Applying spline transformations offers a more flexible form of non-linearity, but the performance gain is still modest.

These results indicate that while the relationship between features and the target is not purely linear, simple transformations are not sufficient to fully capture the underlying patterns.

# Standard Linear Regression
model = make_pipeline(
    StandardScaler(),
    LinearRegression()
)
cross_validate(model, 'LinearRegression')

# Polynomial Features + Ridge Regression
model = make_pipeline(
    StandardScaler(),
    PolynomialFeatures(degree=2, include_bias=False),
    Ridge()
)
cross_validate(model, 'Poly-Ridge')

# Spline Transformation + Ridge Regression
model = make_pipeline(
    StandardScaler(),
    SplineTransformer(n_knots=5, degree=3),
    Ridge()
)
cross_validate(model, 'Spline-Ridge')

# Fold 0: R² = 0.84589
[32m# Overall R²: 0.84589 | Model: LinearRegression (single fold) | Time: 0 min[0m
# Fold 0: R² = 0.84642
[32m# Overall R²: 0.84642 | Model: Poly-Ridge (single fold) | Time: 0 min[0m
# Fold 0: R² = 0.84627
[32m# Overall R²: 0.84627 | Model: Spline-Ridge (single fold) | Time: 0 min[0m

Linear Regression with Statistical Inference

As an alternative to scikit-learn’s LinearRegression, we use statsmodels, which provides detailed statistical diagnostics for linear models. This implementation allows us to inspect regression coefficients, along with their associated standard errors, t-statistics, and p-values.

This is particularly useful during exploratory analysis to:

Assess the statistical significance of each feature,
Interpret the magnitude and direction of feature effects,
Check for potential multicollinearity or model instability.

While the model’s predictive power is not the focus here, the regression summary offers valuable insights into how individual features relate to the target variable.

import statsmodels.api as sm

# Add intercept term
X = sm.add_constant(train[initial_features])

# Fit OLS model and display summary
model = sm.OLS(train['FloodProbability'], X, missing='error')
results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:	FloodProbability	R-squared:	0.845
Model:	OLS	Adj. R-squared:	0.845
Method:	Least Squares	F-statistic:	3.046e+05
Date:	Wed, 02 Apr 2025	Prob (F-statistic):	0.00
Time:	00:32:02	Log-Likelihood:	2.7820e+06
No. Observations:	1117957	AIC:	-5.564e+06
Df Residuals:	1117936	BIC:	-5.564e+06
Df Model:	20
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-0.0533	0.000	-234.995	0.000	-0.054	-0.053
MonsoonIntensity	0.0056	9.25e-06	606.734	0.000	0.006	0.006
TopographyDrainage	0.0056	9.09e-06	621.525	0.000	0.006	0.006
RiverManagement	0.0057	9.18e-06	617.178	0.000	0.006	0.006
Deforestation	0.0057	9.27e-06	612.404	0.000	0.006	0.006
Urbanization	0.0057	9.14e-06	619.319	0.000	0.006	0.006
ClimateChange	0.0057	9.25e-06	612.437	0.000	0.006	0.006
DamsQuality	0.0057	9.13e-06	619.170	0.000	0.006	0.006
Siltation	0.0056	9.21e-06	612.284	0.000	0.006	0.006
AgriculturalPractices	0.0056	9.2e-06	612.643	0.000	0.006	0.006
Encroachments	0.0056	9.14e-06	618.374	0.000	0.006	0.006
IneffectiveDisasterPreparedness	0.0056	9.16e-06	615.995	0.000	0.006	0.006
DrainageSystems	0.0056	9.18e-06	613.641	0.000	0.006	0.006
CoastalVulnerability	0.0057	9.11e-06	622.228	0.000	0.006	0.006
Landslides	0.0056	9.15e-06	616.245	0.000	0.006	0.006
Watersheds	0.0056	9.14e-06	617.853	0.000	0.006	0.006
DeterioratingInfrastructure	0.0056	9.21e-06	609.647	0.000	0.006	0.006
PopulationScore	0.0057	9.17e-06	618.914	0.000	0.006	0.006
WetlandLoss	0.0056	9.2e-06	612.654	0.000	0.006	0.006
InadequatePlanning	0.0056	9.14e-06	613.363	0.000	0.006	0.006
PoliticalFactors	0.0056	9.1e-06	620.512	0.000	0.006	0.006

Omnibus:	100155.250	Durbin-Watson:	2.000
Prob(Omnibus):	0.000	Jarque-Bera (JB):	148528.907
Skew:	0.703	Prob(JB):	0.00
Kurtosis:	4.100	Cond. No.	265.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Additive Feature Structure

An inspection of the linear regression coefficients reveals that all features have approximately the same multiplicative coefficient, estimated to be around 0.00565. This uniformity suggests that the model effectively reduces to a weighted sum of features with a common weight, plus an intercept term.

The model can be approximated as:

$$ \hat{y} = \beta \cdot \sum_{i=1}^{d} x_i + \alpha $$

where:

$\beta \approx 0.00565$,
$\alpha \approx -0.053$,
$x_i$ represents each feature,
$d$ is the number of features.

This implies that the sum of all feature values is a strong linear predictor of the target. Using this simplified form, we can compute an approximate prediction for FloodProbability:

Insight:

This surprisingly effective approximation highlights a latent structure in the dataset: the total feature magnitude strongly correlates with flood probability, even without modeling feature-specific effects.

# Compute approximation using summed features
approx_prediction = train[initial_features].sum(axis=1) * 0.00565 - 0.053

# Evaluate goodness of fit
score = r2_score(train['FloodProbability'], approx_prediction)
print(f"R² Score (approximate model): {score:.5f}")

R² Score (approximate model): 0.84476

Link to the Kaggle Competition#

Loading the Data#

Target Variable: FloodProbability Distribution#

Feature Distributions#

Correlation#

Dimensionality Reduction#

Principal Component Analysis (PCA)#

UMAP#

Supervised UMAP Projection#

t-distributed Stochastic Neighbor Embedding (t-SNE)#

Cross-Validation Strategy#

Linear Models#

Linear Regression with Statistical Inference#

Additive Feature Structure#

Insight:#