Training¶

class training.validator.StructureValidation(parameters)¶

Parameters structure validation

Validate the input parameters and raise an Exception if the structure of the parameters is invalid

Parameters:	parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations.
Raise:	ValueError: TypeError:

__init__(parameters)¶

Parameters:	parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations.

features_validator()¶

Features validator

All datasets given inside the parameters object should have the same features.

Returns:

validate_data()¶

data key validator

The following assumptions have to be met:

At least there is one dataset inside the data and it should had the key name: train
All provided datasets should have two elements: features and target
The value of the key features is a pandas dataframe. The target/label should not be between the features
The target is a numpy array

validate_main_keys()¶

Main keys validator

The user has to provide at least those four keys inside the parameters dictionary.

validate_metrics()¶

metrics key validator

The following assumptions should be met:

The value of the metrics should be a list.
Currently, there are only two regression metrics: r2_score and mean_squared_error, and two classification metrics: accuracy_score and roc_auc_score

validate_model()¶

model key validator

The following assumptions should be met:

The elements type and hyperparameters should be found inside the values of the key model.
The type of the value of the hyperparameters should be a dictionary.

validate_predict()¶

predict key validator

The following assumptions should be met:

If the predict key exists, all of the datasets should have the key features
The value of the features is a pandas dataframe

3. The datasets inside the key predict have no target or labels. It is required to predict the target for those datasets.

validate_split()¶

split key validator

The following assumptions should be met:

There are two elements inside the split key: method and split_ratios or fold_nr
If the value of the method element is split, the second element should be split_ratios
If the value of the method element is kfold, the second element should be fold_nr
The split_ratios can be either a float or set/list of two floats.
The split_ratios values should be in ]0, 1[
The value of the fold_nr should be an integer larger than 1
The method can take only two values: split` or kfold

training.validator.parameters_validator(parameters)¶

Parameters structure validator

Apply all validation methods defined inside the class StructureValidation

Parameters:	parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations.

training.training.train_with_n_split(test_split_ratios: list, stratify: bool, hyperparameters: dict, train_array: numpy.array, target: numpy.array, models_nr: list, model_type: str, required_metrics: list)¶

n split training

This function trains a model to fit the data using n split cross-validation e.g train, test or train, valid and test

Parameters:

test_split_ratios (list) – A list that contains the test split ratio e.g. [0.2] for testing size/training size or [0.2, 0.2] for validation size/training size and testing size/(training size - validation size)
stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
train_array (np.array) – The values of the features that will be split to two sub-datasets based on the split value to multiple datasets.
target (np.array) – The values of the target that will be split to two sub-datasets based on the split value to multiple datasets.
models_nr (list) – A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there is only one model.
model_type (str) – The type of model that will be used to fit the data. Currently there are two values: Ridge linear regression and lightgbm.
required_metrics (list) –

Returns:

models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there is only one model.
save_models_dir - The name of the directory where the trained models are saved locally.

training.training.train_with_kfold_cross_validation(split: dict, stratify: bool, hyperparameters: dict, train_array: numpy.array, target: numpy.array, models_nr: list, model_type, required_metrics: list)¶

K-Fold cross-validation training

This function trains a model to fit the data using K-Fold cross-validation.

Parameters:

split (dict) – A dictionary that contains information about the K-Fold variables
stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
train_array (np.array) – The values of the target that will be split into K-Folds and used to train the model to predict the target
target (np.array) – The values of the target that will be split into K-Folds and used to train the model.
models_nr (list) – A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there are n_fold models.
model_type (str) – The type of model that will be used to fit the data.
required_metrics (list) –

Returns:

models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there are n_fold models.
save_models_dir - The name of the directory where the trained models are saved locally.

training.training.model_training(parameters: dict)¶

Model training

This function trains a model to fit the data using the Scikit Learn of Ridge linear model implementation

Parameters:

parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations. Check the example below.

Returns:

models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training.
save_models_dir - The name of the directory where the trained models are saved locally.

Example:

One split: train and test

>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "split",
>>>          "split_ratios": 0.2,
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": 1,
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }

Two splits: train, valid and test

>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "split",
>>>          "split_ratios": (0.2, 0.2), # or [0.2, 0.2]
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": 1,
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }

KFold cross-validation:

>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "kfold",
>>>          "fold_nr": 5,
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": 1,
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }

KFold cross-validation with alpha optimization:

>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "kfold",
>>>          "fold_nr": 5,
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": "optimize",
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }

training.optimizer.training_for_optimizing(alpha_i: float, x_train: numpy.array, y_train: numpy.array, x_test: numpy.array, y_test: numpy.array, help_text: str) → float¶

Trainer

This function trains the ridge linear regression model given a certain alpha.

Parameters:

alpha_i (float) – A hyperparameter which used by the ridge linear regression to avoid over-fitting.
x_train (np.array) – The values of the features which are used to train the model to predict the target y_train
y_train (np.array) – The target values which are used to train the model.
x_test (np.array) – The values of the features which are used to evaluate the model by predicting the target y_test
y_test (np.array) – The target values which are used to evaluate the performance of the model based on the coefficient of determination R2
help_text (str) – A string to show useful information about the training cross-validation method

Returns:

r2_linear - Coefficient of determination for a given alpha and testing dataset

training.optimizer.get_best_alpha_split(x_train: numpy.array, y_train: numpy.array, x_test: numpy.array, y_test: numpy.array) → float¶

alpha optimizer for two datasets split

This function finds the best alpha value based on the coefficient of determination. This function will be replaced by a native optimization method from other packages

Parameters:

x_train (np.array) – The values of the features which are used to train the model to predict the target y_train
y_train (np.array) – The target values which are used to train the model.
x_test (np.array) – The values of the features which are used to evaluate the model by predicting the target y_test
y_test (np.array) – The target values which are used to evaluate the performance of the model based on the coefficient of determination R2

Returns:

best_alpha: The alpha value that maximize the coefficient of determination R2

training.optimizer.get_best_alpha_kfold(kfold, train_array: numpy.array, target: numpy.array)¶

alpha optimizer for K-Fold cross validation

Parameters:	kfold – train_array – The values of the features which are used to train the model to predict the target target target – The target values which are used to train the model.
Returns:	best_alpha: The alpha value that maximize the coefficient of determination R2

training.xgboost_train.xgboost_data_preparation(validation_list: list, dataframe: pandas.core.frame.DataFrame, target: numpy.array, key: str)¶

xgboost data preparing for training

The function transforms the data from a Pandas dataframe format to a xgboost-compatible format.

Parameters:

validation_list (list) – The list that contains the data the should be used to train and validate the model.
dataframe (pd.DataFrame) – Pandas dataframe that contains the data which will be transformed to xgboost format.
target (np.array) – An array that contains the target that should be predict by the xgboost model
key (str) – A label that is used to name the dataset in the validation_list

Returns:

The updated validation_list

training.xgboost_train.xgboost_regression_train(validation_list: list, hyperparameters: dict, num_round: int = 10)¶

xgboost trainer

The function uses the xgboost framework to train the model

Parameters:

validation_list (list) – The list that contains the data the should be used to train and validate the model.
hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
num_round (int) – The number of rounds for boosting

Returns:

xgboost model

training.xgboost_train.xgboost_data_preparation_to_predict(dataframe: pandas.core.frame.DataFrame)¶

xgboost data preparing for prediction

The function transforms the data from a Pandas dataframe format to a xgboost-compatible format

Parameters:	dataframe (pd.DataFrame) – Pandas dataframe that contains the data which will be transformed to xgboost format.
Returns:	The dataset in xgboost-compatible format

training.xgboost_train.training_xgboost_n_split(sub_datasets: dict, hyperparameters: dict, num_round: int = 10)¶

XGboost training with n-split

This function trains a model to fit the data using n split cross-validation e.g train, test or train, valid and test

Parameters:

sub_datasets (dict) –
hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method need to train the model.
num_round (int) – The number of rounds for boosting

Returns:

model: xgboost model.
problem_to_solve: string that defines the problem to solve: regression or classification.
validation_list: The list that contains the data the should be used to train and validate the model.

training.xgboost_train.training_xgboost_kfold(train_array, target, train: list, test: list, hyperparameters: dict, num_round: int = 10)¶

XGboost training with kfold

This function trains a model to fit the data using K-Fold cross-validation.

Parameters:

train_array (np.array) – The values of the target that will be split into K-Folds and used to train the model to predict the target
target (np.array) – The values of the target that will be split into K-Folds and used to train the model.
train (list) – A list of integers that define the training dataset
test (list) – A list of integers that define the testing dataset
hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
num_round (int) – The number of rounds for boosting

Returns:

kfold_model: xgboost model
problem_to_solve: string that defines the problem to solve: regression or classification.
validation_list: The list that contains the data the should be used to train and validate the model.

training.xgboost_train.get_num_round(hyperparameters) → int¶

num_round getter

Get the value of num_round that will be used to train the xgboost model

Parameters:	hyperparameters – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
Returns:	num_round: The number of rounds for boosting

training.xgboost_train.xgboost_data_preparation_for_evaluation(data: dict)¶

Date preparation for evaluation

Prepare the data in a form that could be used for model evaluation.

Parameters:	data –
Returns:

training.model_evaluator.load_all_models(save_models_dir: str, model_type: str, model_i: int)¶

Model loader

Load saved models from a given type.

Parameters:	save_models_dir (str) – directory where the model is saved model_type (str) – regression or classification model_i (int) – index used to distinguish models of the same type trained on different datasets.
Returns:

training.model_evaluator.evaluate_model(model, xs: list, ys: list, labels: list, metrics: list)¶

Model evaluator

This function shows the value of the matrices R2 and MSE for different datasets when evaluating the trained model.

Parameters:

model – An object created by the training package e.g. Scikit Learn.
xs (list) – Every element is a np.array of the features that are used to predict the target variable.
ys (list) – Every element is a np.array of the target variable.
labels (list) – Every element is a string that is used to label every (x,y) pair and refers to their origin.
metrics (list) – list of metrics used to evaluate model.

Returns:

metrics_summary (dict): all metrics from metrics applied to all (y, y_pred=model(x)) paris.

training.utils.read_kfold_config(split: dict)¶

KFold values reader

This function ensures that the parameters of the KFold splitting method are defined.

Parameters:	split (dict) – A dictionary that contains the parameters about the KFold splitting method.
Returns:	n_fold - An integer that refers to the number of folds which will be used for cross-validation. shuffle - A boolean. If true, data will be shuffled before splitting it to multiple folds. random_state - An integer which helps to reproduce the results.

training.utils.create_model_directory(path: str)¶

Model directory creator

This function create a directory where the model during and after training will be saved.

Parameters:	path (str) – It refers to the location where the models should be saved.

training.utils.save_model_locally(path: str, model: object)¶

Model saver

This function saves the model locally in pickle format.

Parameters:	path (str) – It refers to the location where the models should be saved. model (object) – An object created by the training package e.g. Scikit Learn.

training.utils.input_parameters_extraction(parameters: dict)¶

Input data parsing

Parameters: parameters – dict parameters: A dictionary that contains information about the datasets, model type, model configurations and training configurations. Check the example below.

Returns:

data - A dictionary that contains pandas dataframes as datasets.
split - A dictionary that contains information about the cross-validation method.
train_array - A numpy array that is used to train the model and predict the target.
target - A numpy array that is used to train the model.
predict - If provided, a pandas dataframe that contains the features without the labels (target). Otherwise bool: False

training.utils.split_dataset(features: numpy.array, target: numpy.array, tests_split_ratios: Union[list, set], stratify: bool = False) → dict¶

Dataset splitter

This function split a dataset to multiple datasets such train, valid and test.

Parameters:	features (np.array) – The original dataset that should be split to subsets target (np.array) – The original target/labels dataset that should be predicted set] tests_split_ratios (Union[list,) – A list or set of floats that represent the ratio of the size of the test dataset ot the train dataset. The values should be in the range ]0, 1[ e.g. tests_split_ratios = [0.2, 0.2] stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
Returns:	sub_datasets: A dictionary that contains the test and train dataset
Return type:	dict