Case Study: Forecasting Streamflow (CAMELS)

In this section, I will present a forecasting example using a data set from CAMELS.

from pywddff.datasets import get_camels_subset

camels_subset = get_camels_subset()
camels_ids = list(camels_subset.keys())
df = camels_subset[camels_ids[0]]
df

	Q(ft3/s)	dayl(s)	prcp(mm/day)	srad(W/m2)	tmax(C)	tmin(C)	vp(Pa)
0	18.00	34905.61	0.00	175.32	9.00	2.46	720.00
1	15.00	34905.61	0.00	289.07	11.50	-2.51	519.54
2	13.00	34905.61	0.00	297.24	11.99	-3.00	480.00
3	13.00	34905.61	2.26	190.21	8.55	-0.50	600.00
4	33.00	34905.61	5.49	151.74	5.50	-2.00	520.00
...	...	...	...	...	...	...	...
11226	0.02	42508.80	0.00	302.88	29.16	18.00	2080.05
11227	0.47	42163.21	26.81	120.50	22.49	17.21	1953.55
11228	0.50	42163.21	14.10	207.48	23.97	14.50	1640.00
11229	0.69	42163.21	12.19	161.60	22.21	15.13	1730.51
11230	15.00	41817.61	40.76	164.12	23.21	16.00	1800.00

11231 rows × 7 columns

Let’s plot the target variable using an arbitrary time index.

import matplotlib.pyplot as plt

# Let's plot the data
df['Q(ft3/s)'].plot(kind='line')
plt.xlabel('Index')
plt.ylabel('Q (ft3/s)')
plt.title('Plot of Q (ft3/s)')
plt.show()

Matplotlib is building the font cache; this may take a moment.

_images/713202991ac0e3b18bed8b0fe430135450b5cbf5ad33316cf66c521d11f8851d.png

Add coefficient features

The choice of filter and J (decomposition level) is not trivial, and ideally we should use grid search over multiple filters and decomposition levels. Here, we will choose filter = "la8" and a decomposition level of J = 2.

from pywddff.pywddff import multi_stationary_dwt

X = df.iloc[:, 1:]
y = df.iloc[:, 0]

X_new, y_new = multi_stationary_dwt(X, y, 
                                    transform = 'atrousdwt', 
                                    filter = 'la8', 
                                    J = 2, 
                                    remove_bc = True, 
                                    approach = "single hybrid")
list(X_new)

['dayl(s)',
 'prcp(mm/day)',
 'srad(W/m2)',
 'tmax(C)',
 'tmin(C)',
 'vp(Pa)',
 'dayl(s)_W1',
 'dayl(s)_W2',
 'dayl(s)_V2',
 'prcp(mm/day)_W1',
 'prcp(mm/day)_W2',
 'prcp(mm/day)_V2',
 'srad(W/m2)_W1',
 'srad(W/m2)_W2',
 'srad(W/m2)_V2',
 'tmax(C)_W1',
 'tmax(C)_W2',
 'tmax(C)_V2',
 'tmin(C)_W1',
 'tmin(C)_W2',
 'tmin(C)_V2',
 'vp(Pa)_W1',
 'vp(Pa)_W2',
 'vp(Pa)_V2']

Leading the target variable

For the purpose of this case study, I will do 2 days ahead forecasting.

from pywddff.utils import prep_forecast_data

h = 2 # forecast horizon (2 days ahead in this case)

X_new, y_new = prep_forecast_data(X_new, y_new, h = h)

Data splitting

Here, I will split our data set into train, validation and test sets.

from pywddff.utils import absolute_split_3

X_train, X_val, X_test, y_train, y_val, y_test = absolute_split_3(X_new.to_numpy(), 
                                                                  y_new.to_numpy(), 
                                                                  nval = 2*365, 
                                                                  ntest = 365)

Correlation matrix

import seaborn as sns
import numpy as np
import pandas as pd

arr = np.column_stack((y_train, X_train))

# Create a pandas DataFrame from the array
arr_df = pd.DataFrame(arr, columns=['target'] + list(X_new))

# Compute the correlation matrix
corr = arr_df.corr()

# Generate a mask for the upper triangle (optional)
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(9, 6))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.show()

_images/fbed4d7db0581443b65d64e5d35d2dec15bb84a4a1686b3a8caac8244859a0ff.png

Model comparison

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# Assume that your arrays are loaded correctly

models = {
    "Linear Regression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(),
    "ElasticNet": ElasticNet(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "Gradient Boosting": GradientBoostingRegressor(),
    "Support Vector Regression": SVR()
}

# Function to calculate Nash Sutcliffe Efficiency Coefficient
def nse(y_obs, y_pred):
    return 1 - sum((y_obs - y_pred) ** 2) / sum((y_obs - np.mean(y_obs)) ** 2)

# Apply feature scaling to the input features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Dictionary to store the validation scores of the models
validation_scores = {}

for name in models:
    model = models[name]
    model.fit(X_train_scaled, y_train)  # Train the model
    predictions = model.predict(X_val_scaled)  # Generate predictions on the validation set
    nse_score = nse(y_val, predictions)  # Compute NSE
    validation_scores[name] = nse_score

# Select the best model based on validation NSE
best_model_name = max(validation_scores, key=validation_scores.get)
best_model = models[best_model_name]

print(f"Best model based on validation NSE is: {best_model_name}")

# Compute the test NSE for the best model
test_predictions = best_model.predict(X_test_scaled)
nse_test = nse(y_test, test_predictions)

print(f"Test NSE for the best model is: {nse_test}")

# Create a bar plot of the validation NSEs
plt.figure(figsize=(7, 4))
plt.bar(validation_scores.keys(), validation_scores.values())
plt.title('Validation NSE for different models')
plt.xlabel('Model')
plt.ylabel('NSE')
plt.xticks(rotation=45)  # Rotate the x-axis labels for better readability
plt.tight_layout()  # Adjust subplot parameters to give specified padding
plt.show()

Best model based on validation NSE is: Support Vector Regression

Test NSE for the best model is: -0.014246825269703445

_images/bde110b175cc56ed7f0f58792cb1323df27e91a9e41c77f90ff83077a1f6af8e.png

All in all, terrible results, but that was not the point. Here you saw a basic ML forecasting example involving wavelet feature engineering using the pywddff package.