pywddff.utils
Module Contents
Functions
|
Load a pickle file. |
|
Inserts a specified number of zeros between each element in a 1D numpy array. |
|
Perform circular convolution. Note that signal and ker must have same shape. |
|
Creates a DataFrame (or a NumPy array) where each column is a lagged version of the input series. |
|
Creates a list of string names for original and lagged inputs. |
|
Creates a list of string names for original and lagged inputs, based on the original input names provided. |
|
Add lagged variables to a given input dataset X, and optionally adjust the target variable y to match the new structure. |
|
Determines the size of the test set based on the provided fraction. |
|
Determines the size of the validation and test sets based on the provided fractions. |
|
Splits the input dataset (X, y) into training and test sets based on an absolute number. |
|
Splits the input dataset (X, y) into training, validation, and test sets based on absolute numbers. |
|
Prepare an input feature set X and target y for forecasting by specifying the forecast horizon h. |
- pywddff.utils.load_pickle(filepath)[source]
Load a pickle file.
- Parameters:
filepath (str) – A string that indicates the path to the pickle file (including the pickle file itself).
- Returns:
Object that was stored in filepath.
- pywddff.utils.insert_zeros_between(x, j)[source]
Inserts a specified number of zeros between each element in a 1D numpy array. The first set of zeros are inserted between the first and second elements in x. No zeros are inserted after the last element in x.
- Parameters:
x (np.ndarray) – A 1D numpy array.
j (int) – Number of zeros to insert between elements of x.
- Returns:
A 1D numpy array.
- Return type:
np.ndarray
- pywddff.utils.circ_conv(signal, ker)[source]
Perform circular convolution. Note that signal and ker must have same shape. Reference: https://stackoverflow.com/questions/35474078/python-1d-array-circular-convolution
- Parameters:
signal (np.ndarray) – A 1D numpy array.
ker (np.ndarray) – A 1D numpy array.
- Returns:
A 1D numpy array.
- Return type:
np.ndarray
- pywddff.utils.add_lags(x, n_lags, pandas_output=False)[source]
Creates a DataFrame (or a NumPy array) where each column is a lagged version of the input series.
- Parameters:
x (array-like) – Input sequence of data points.
n_lags (int) – Number of lags to include in the output.
pandas_output (bool, optional) – If True, the output will be a pandas DataFrame. If False, the output will be a NumPy array. Defaults to False.
- Returns:
output – DataFrame (or NumPy array) with original series and its lagged versions. Each column corresponds to a lag (from 0 to n_lags). The output excludes rows where lagged data is not available due to shifting (NA values).
- Return type:
pandas.DataFrame or numpy.ndarray
Example
>>> add_lags([1, 2, 3, 4, 5], 2, True) 0 1 2 2 3 2 1 3 4 3 2 4 5 4 3
- pywddff.utils.make_lag_names(n_inputs, n_lags)[source]
Creates a list of string names for original and lagged inputs.
- Parameters:
n_inputs (int) – Number of original input variables.
n_lags (int) – Number of lags for each input variable.
- Returns:
out – List of names for the original and lagged input variables. Each original input variable is named as ‘Xn’, where n is the input number (1-indexed). Each lagged version of an input variable is named as ‘Xn_lag_m’, where n is the input number (1-indexed) and m is the lag number. The lag number for the original (unlagged) variables is dropped, so they are named just ‘Xn’.
- Return type:
list of str
Example
>>> make_lag_names(2, 3) ['X1', 'X1_lag_1', 'X1_lag_2', 'X1_lag_3', 'X2', 'X2_lag_1', 'X2_lag_2', 'X2_lag_3']
- pywddff.utils.make_lag_names_from_list(orig_input_names, n_lags)[source]
Creates a list of string names for original and lagged inputs, based on the original input names provided.
- Parameters:
orig_input_names (list of str) – List of original input variable names.
n_lags (int) – Number of lags for each input variable.
- Returns:
out – List of names for the original and lagged input variables. Each original input variable name is appended with ‘_lag_m’, where m is the lag number. The lag number for the original (unlagged) variables is dropped.
- Return type:
list of str
Example
>>> make_lag_names_from_list(['temp', 'humidity'], 3) ['temp', 'temp_lag_1', 'temp_lag_2', 'temp_lag_3', 'humidity', 'humidity_lag_1', 'humidity_lag_2', 'humidity_lag_3']
- pywddff.utils.add_lagged_variables(X, y=None, n_lags=1)[source]
Add lagged variables to a given input dataset X, and optionally adjust the target variable y to match the new structure.
- Parameters:
X (numpy.ndarray or pandas.DataFrame) – Input dataset with shape (n_samples, n_features). Each feature is transformed to include its lags.
y (numpy.ndarray or pandas.Series or pandas.DataFrame, optional) – Target variable with shape (n_samples,). If provided, it is adjusted to match the new structure of X. The first n_lags samples are dropped to match the size of X after adding the lagged variables. Defaults to None, in which case only X is processed and returned.
n_lags (int, optional) – Number of lags to add for each feature in X. Defaults to 1.
- Returns:
out (numpy.ndarray or pandas.DataFrame) – Transformed input dataset with added lagged variables. If X was a pandas DataFrame, out is also a pandas DataFrame, with column names adjusted to reflect the lags. If X was a numpy array, out is also a numpy array.
y (numpy.ndarray or pandas.Series, optional) – Adjusted target variable. Only returned if y was provided as input. If y was a pandas Series or DataFrame, the output y is a pandas Series. If y was a numpy array, the output y is also a numpy array.
Notes
This function requires the add_lags and make_lag_names_from_list functions to work.
- Raises:
AssertionError – If the shape of X is not (n_samples, n_features) with n_samples > n_features. If y is provided and its adjusted shape does not match the number of samples in the transformed X.
Example
>>> X = pd.DataFrame({'temp': [1, 2, 3, 4], 'humidity': [30, 40, 50, 60]}) >>> y = pd.Series([0, 1, 0, 1]) >>> add_lagged_variables(X, y, 2) ( temp temp_lag_1 temp_lag_2 humidity humidity_lag_1 humidity_lag_2 0 3.0 2.0 1.0 50.0 40.0 30.0 1 4.0 3.0 2.0 60.0 50.0 40.0, 0 0 1 1 Name: y, dtype: int64)
- pywddff.utils.test_size(X, test_frac=0.2)[source]
Determines the size of the test set based on the provided fraction.
- Parameters:
X (numpy.ndarray) – Input dataset with shape (n_samples, n_features).
test_frac (float, optional) – Fraction of the total samples to be used for the test set. Default is 0.2 (20% of total samples).
- Returns:
test_size – Number of samples in the test set.
- Return type:
int
- Raises:
AssertionError – If test_frac is greater or equal to 1, raising a ValueError.
Example
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]]) >>> test_size(X, test_frac=0.3) 1
- pywddff.utils.val_test_sizes(X, val_frac=0.1, test_frac=0.2)[source]
Determines the size of the validation and test sets based on the provided fractions.
- Parameters:
X (numpy.ndarray) – Input dataset with shape (n_samples, n_features).
val_frac (float, optional) – Fraction of the total samples to be used for the validation set. Default is 0.1 (10% of total samples).
test_frac (float, optional) – Fraction of the total samples to be used for the test set. Default is 0.2 (20% of total samples).
- Returns:
val_size (int) – Number of samples in the validation set.
test_size (int) – Number of samples in the test set.
- Raises:
AssertionError – If the sum of test_frac and val_frac is greater or equal to 1, raising a ValueError.
Example
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]]) >>> val_test_sizes(X, val_frac=0.2, test_frac=0.3) (1, 1)
- pywddff.utils.absolute_split_2(X, y, ntest)[source]
Splits the input dataset (X, y) into training and test sets based on an absolute number.
- Parameters:
X (numpy.ndarray) – Input dataset with shape (n_samples, n_features).
y (numpy.ndarray) – Target variable with shape (n_samples,).
ntest (int) – Number of samples to include in the test set.
- Returns:
X (numpy.ndarray) – Training input dataset.
X_test (numpy.ndarray) – Test input dataset.
y (numpy.ndarray) – Training target variable.
y_test (numpy.ndarray) – Test target variable.
- Raises:
AssertionError – If the input dimensions do not match the requirements. If ntest is not a positive integer. If the total number of samples is less than ntest. If the final training or test set do not match the expected sizes.
Example
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]]) >>> y = np.array([1, 0, 1, 0]) >>> absolute_split_2(X, y, 1) (array([[1, 2], [3, 4], [5, 6]]), array([[7, 8]]), array([1, 0, 1]), array([0]))
- pywddff.utils.absolute_split_3(X, y, nval, ntest)[source]
Splits the input dataset (X, y) into training, validation, and test sets based on absolute numbers.
- Parameters:
X (numpy.ndarray) – Input dataset with shape (n_samples, n_features).
y (numpy.ndarray) – Target variable with shape (n_samples,).
nval (int) – Number of samples to include in the validation set.
ntest (int) – Number of samples to include in the test set.
- Returns:
X (numpy.ndarray) – Training input dataset.
X_val (numpy.ndarray) – Validation input dataset.
X_test (numpy.ndarray) – Test input dataset.
y (numpy.ndarray) – Training target variable.
y_val (numpy.ndarray) – Validation target variable.
y_test (numpy.ndarray) – Test target variable.
- Raises:
AssertionError – If the input dimensions do not match the requirements. If nval or ntest are not positive integers. If the total number of samples is less than nval + ntest. If the final training, validation, or test sets do not match the expected sizes.
Example
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]) >>> y = np.array([1, 0, 1, 0, 1]) >>> absolute_split_3(X, y, 2, 1) (array([[1, 2], [3, 4]]), array([[5, 6], [7, 8]]), array([[ 9, 10]]), array([1, 0]), array([1, 0]), array([1]))
- pywddff.utils.prep_forecast_data(X, y, h, auto_regress_y=False)[source]
Prepare an input feature set X and target y for forecasting by specifying the forecast horizon h. The output of this function is a tuple with input features and target such that each row of input features maps to a future observation of the target. This setup allows cross validation to be used when evaluating machine learning models.
- Parameters:
X (np.ndarray or pd.DataFrame) – A 2D numpy array or pandas data frame.
y (np.ndarray, pd.Series, or pd.DataFrame) – A 1D numpy array, pandas series or pandas data frame.
h (int) – Forecast horizon.
auto_regress_y (bool) – Whether the target should be included as an auto-regressive feature (to exploit autocorrelations present in the target variable).
- Returns:
- if auto_regress = False (the default):
First element is a 2D numpy array with X.shape[1] columns. If X was given as a pandas data frame, the output will be a pandas data frame. The number of rows will be h less than X.shape[0] of the originally provided X.
Second element is a 1D array corresponding to the target y provided by the user. If X was given as a pandas data frame, the output will be a pandas series with name “y”. The number of values will be h less than y.shape[0] of the originally provided y.
- if auto_regress = True:
First element is a 2D numpy array with X.shape[1]+1 columns. The first column will contain the auto-regressive target feature (essentially a lagged version of the target). If X was given as a pandas data frame, the output will be a pandas data frame. The number of rows will be h less than X.shape[0] of the originally provided X.
Second element is a 1D array corresponding to the target y provided by the user. If X was given as a pandas data frame, the output will be a pandas series with name “y”. The number of values will be h less than y.shape[0] of the originally provided y.
- Return type:
tuple