Training

class training.validator.StructureValidation(parameters)

Parameters structure validation

Validate the input parameters and raise an Exception if the structure of the parameters is invalid

Parameters:parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations.
Raise:
ValueError:
TypeError:
__init__(parameters)
Parameters:parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations.
features_validator()

Features validator

All datasets given inside the parameters object should have the same features.

Returns:
validate_data()

data key validator

The following assumptions have to be met:

  1. At least there is one dataset inside the data and it should had the key name: train
  2. All provided datasets should have two elements: features and target
  3. The value of the key features is a pandas dataframe. The target/label should not be between the features
  4. The target is a numpy array
validate_main_keys()

Main keys validator

The user has to provide at least those four keys inside the parameters dictionary.

validate_metrics()

metrics key validator

The following assumptions should be met:

  1. The value of the metrics should be a list.
  2. Currently, there are only two regression metrics: r2_score and mean_squared_error, and two classification metrics: accuracy_score and roc_auc_score
validate_model()

model key validator

The following assumptions should be met:

  1. The elements type and hyperparameters should be found inside the values of the key model.
  2. The type of the value of the hyperparameters should be a dictionary.
validate_predict()

predict key validator

The following assumptions should be met:

  1. If the predict key exists, all of the datasets should have the key features
  2. The value of the features is a pandas dataframe

3. The datasets inside the key predict have no target or labels. It is required to predict the target for those datasets.

validate_split()

split key validator

The following assumptions should be met:

  1. There are two elements inside the split key: method and split_ratios or fold_nr
  2. If the value of the method element is split, the second element should be split_ratios
  3. If the value of the method element is kfold, the second element should be fold_nr
  4. The split_ratios can be either a float or set/list of two floats.
  5. The split_ratios values should be in ]0, 1[
  6. The value of the fold_nr should be an integer larger than 1
  7. The method can take only two values: split` or kfold
training.validator.parameters_validator(parameters)

Parameters structure validator

Apply all validation methods defined inside the class StructureValidation

Parameters:parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations.
training.training.train_with_n_split(test_split_ratios: list, stratify: bool, hyperparameters: dict, train_array: numpy.array, target: numpy.array, models_nr: list, model_type: str, required_metrics: list)

n split training

This function trains a model to fit the data using n split cross-validation e.g train, test or train, valid and test

Parameters:
  • test_split_ratios (list) – A list that contains the test split ratio e.g. [0.2] for testing size/training size or [0.2, 0.2] for validation size/training size and testing size/(training size - validation size)
  • stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
  • hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
  • train_array (np.array) – The values of the features that will be split to two sub-datasets based on the split value to multiple datasets.
  • target (np.array) – The values of the target that will be split to two sub-datasets based on the split value to multiple datasets.
  • models_nr (list) – A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there is only one model.
  • model_type (str) – The type of model that will be used to fit the data. Currently there are two values: Ridge linear regression and lightgbm.
  • required_metrics (list) –
Returns:

  • models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there is only one model.
  • save_models_dir - The name of the directory where the trained models are saved locally.

training.training.train_with_kfold_cross_validation(split: dict, stratify: bool, hyperparameters: dict, train_array: numpy.array, target: numpy.array, models_nr: list, model_type, required_metrics: list)

K-Fold cross-validation training

This function trains a model to fit the data using K-Fold cross-validation.

Parameters:
  • split (dict) – A dictionary that contains information about the K-Fold variables
  • stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
  • hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
  • train_array (np.array) – The values of the target that will be split into K-Folds and used to train the model to predict the target
  • target (np.array) – The values of the target that will be split into K-Folds and used to train the model.
  • models_nr (list) – A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there are n_fold models.
  • model_type (str) – The type of model that will be used to fit the data.
  • required_metrics (list) –
Returns:

  • models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there are n_fold models.
  • save_models_dir - The name of the directory where the trained models are saved locally.

training.training.model_training(parameters: dict)

Model training

This function trains a model to fit the data using the Scikit Learn of Ridge linear model implementation

Parameters:

parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations. Check the example below.

Returns:

  • models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training.
  • save_models_dir - The name of the directory where the trained models are saved locally.

Example:
One split: train and test
>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "split",
>>>          "split_ratios": 0.2,
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": 1,
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }
Two splits: train, valid and test
>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "split",
>>>          "split_ratios": (0.2, 0.2), # or [0.2, 0.2]
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": 1,
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }
KFold cross-validation:
>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "kfold",
>>>          "fold_nr": 5,
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": 1,
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }
KFold cross-validation with alpha optimization:
>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "kfold",
>>>          "fold_nr": 5,
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": "optimize",
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }
training.optimizer.training_for_optimizing(alpha_i: float, x_train: numpy.array, y_train: numpy.array, x_test: numpy.array, y_test: numpy.array, help_text: str) → float

Trainer

This function trains the ridge linear regression model given a certain alpha.

Parameters:
  • alpha_i (float) – A hyperparameter which used by the ridge linear regression to avoid over-fitting.
  • x_train (np.array) – The values of the features which are used to train the model to predict the target y_train
  • y_train (np.array) – The target values which are used to train the model.
  • x_test (np.array) – The values of the features which are used to evaluate the model by predicting the target y_test
  • y_test (np.array) – The target values which are used to evaluate the performance of the model based on the coefficient of determination R2
  • help_text (str) – A string to show useful information about the training cross-validation method
Returns:

  • r2_linear - Coefficient of determination for a given alpha and testing dataset

training.optimizer.get_best_alpha_split(x_train: numpy.array, y_train: numpy.array, x_test: numpy.array, y_test: numpy.array) → float

alpha optimizer for two datasets split

This function finds the best alpha value based on the coefficient of determination. This function will be replaced by a native optimization method from other packages

Parameters:
  • x_train (np.array) – The values of the features which are used to train the model to predict the target y_train
  • y_train (np.array) – The target values which are used to train the model.
  • x_test (np.array) – The values of the features which are used to evaluate the model by predicting the target y_test
  • y_test (np.array) – The target values which are used to evaluate the performance of the model based on the coefficient of determination R2
Returns:

best_alpha: The alpha value that maximize the coefficient of determination R2

training.optimizer.get_best_alpha_kfold(kfold, train_array: numpy.array, target: numpy.array)

alpha optimizer for K-Fold cross validation

Parameters:
  • kfold
  • train_array – The values of the features which are used to train the model to predict the target target
  • target – The target values which are used to train the model.
Returns:

best_alpha: The alpha value that maximize the coefficient of determination R2

training.xgboost_train.xgboost_data_preparation(validation_list: list, dataframe: pandas.core.frame.DataFrame, target: numpy.array, key: str)

xgboost data preparing for training

The function transforms the data from a Pandas dataframe format to a xgboost-compatible format.

Parameters:
  • validation_list (list) – The list that contains the data the should be used to train and validate the model.
  • dataframe (pd.DataFrame) – Pandas dataframe that contains the data which will be transformed to xgboost format.
  • target (np.array) – An array that contains the target that should be predict by the xgboost model
  • key (str) – A label that is used to name the dataset in the validation_list
Returns:

  • The updated validation_list

training.xgboost_train.xgboost_regression_train(validation_list: list, hyperparameters: dict, num_round: int = 10)

xgboost trainer

The function uses the xgboost framework to train the model

Parameters:
  • validation_list (list) – The list that contains the data the should be used to train and validate the model.
  • hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
  • num_round (int) – The number of rounds for boosting
Returns:

  • xgboost model

training.xgboost_train.xgboost_data_preparation_to_predict(dataframe: pandas.core.frame.DataFrame)

xgboost data preparing for prediction

The function transforms the data from a Pandas dataframe format to a xgboost-compatible format

Parameters:dataframe (pd.DataFrame) – Pandas dataframe that contains the data which will be transformed to xgboost format.
Returns:
  • The dataset in xgboost-compatible format
training.xgboost_train.training_xgboost_n_split(sub_datasets: dict, hyperparameters: dict, num_round: int = 10)

XGboost training with n-split

This function trains a model to fit the data using n split cross-validation e.g train, test or train, valid and test

Parameters:
  • sub_datasets (dict) –
  • hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method need to train the model.
  • num_round (int) – The number of rounds for boosting
Returns:

  • model: xgboost model.
  • problem_to_solve: string that defines the problem to solve: regression or classification.
  • validation_list: The list that contains the data the should be used to train and validate the model.

training.xgboost_train.training_xgboost_kfold(train_array, target, train: list, test: list, hyperparameters: dict, num_round: int = 10)

XGboost training with kfold

This function trains a model to fit the data using K-Fold cross-validation.

Parameters:
  • train_array (np.array) – The values of the target that will be split into K-Folds and used to train the model to predict the target
  • target (np.array) – The values of the target that will be split into K-Folds and used to train the model.
  • train (list) – A list of integers that define the training dataset
  • test (list) – A list of integers that define the testing dataset
  • hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
  • num_round (int) – The number of rounds for boosting
Returns:

  • kfold_model: xgboost model
  • problem_to_solve: string that defines the problem to solve: regression or classification.
  • validation_list: The list that contains the data the should be used to train and validate the model.

training.xgboost_train.get_num_round(hyperparameters) → int

num_round getter

Get the value of num_round that will be used to train the xgboost model

Parameters:hyperparameters – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
Returns:
  • num_round: The number of rounds for boosting
training.xgboost_train.xgboost_data_preparation_for_evaluation(data: dict)

Date preparation for evaluation

Prepare the data in a form that could be used for model evaluation.

Parameters:data
Returns:
training.model_evaluator.load_all_models(save_models_dir: str, model_type: str, model_i: int)

Model loader

Load saved models from a given type.

Parameters:
  • save_models_dir (str) – directory where the model is saved
  • model_type (str) – regression or classification
  • model_i (int) – index used to distinguish models of the same type trained on different datasets.
Returns:

training.model_evaluator.evaluate_model(model, xs: list, ys: list, labels: list, metrics: list)

Model evaluator

This function shows the value of the matrices R2 and MSE for different datasets when evaluating the trained model.

Parameters:
  • model – An object created by the training package e.g. Scikit Learn.
  • xs (list) – Every element is a np.array of the features that are used to predict the target variable.
  • ys (list) – Every element is a np.array of the target variable.
  • labels (list) – Every element is a string that is used to label every (x,y) pair and refers to their origin.
  • metrics (list) – list of metrics used to evaluate model.
Returns:

metrics_summary (dict): all metrics from metrics applied to all (y, y_pred=model(x)) paris.

training.utils.read_kfold_config(split: dict)

KFold values reader

This function ensures that the parameters of the KFold splitting method are defined.

Parameters:split (dict) – A dictionary that contains the parameters about the KFold splitting method.
Returns:
  • n_fold - An integer that refers to the number of folds which will be used for cross-validation.
  • shuffle - A boolean. If true, data will be shuffled before splitting it to multiple folds.
  • random_state - An integer which helps to reproduce the results.
training.utils.create_model_directory(path: str)

Model directory creator

This function create a directory where the model during and after training will be saved.

Parameters:path (str) – It refers to the location where the models should be saved.
training.utils.save_model_locally(path: str, model: object)

Model saver

This function saves the model locally in pickle format.

Parameters:
  • path (str) – It refers to the location where the models should be saved.
  • model (object) – An object created by the training package e.g. Scikit Learn.
training.utils.input_parameters_extraction(parameters: dict)

Input data parsing

Parameters:parameters – dict parameters: A dictionary that contains information about the datasets, model type, model configurations and training configurations. Check the example below.
Returns:
  • data - A dictionary that contains pandas dataframes as datasets.
  • split - A dictionary that contains information about the cross-validation method.
  • train_array - A numpy array that is used to train the model and predict the target.
  • target - A numpy array that is used to train the model.
  • predict - If provided, a pandas dataframe that contains the features without the labels (target). Otherwise bool: False
training.utils.split_dataset(features: numpy.array, target: numpy.array, tests_split_ratios: Union[list, set], stratify: bool = False) → dict

Dataset splitter

This function split a dataset to multiple datasets such train, valid and test.

Parameters:
  • features (np.array) – The original dataset that should be split to subsets
  • target (np.array) – The original target/labels dataset that should be predicted
  • set] tests_split_ratios (Union[list,) – A list or set of floats that represent the ratio of the size of the test dataset ot the train dataset. The values should be in the range ]0, 1[ e.g. tests_split_ratios = [0.2, 0.2]
  • stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
Returns:

sub_datasets: A dictionary that contains the test and train dataset

Return type:

dict