
class training.validator.StructureValidation(parameters)

Parameters structure validation

Validate the input parameters and raise an Exception if the structure of the parameters is invalid

Parameters:parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations.
Parameters:parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations.

Features validator

All datasets given inside the parameters object should have the same features.


data key validator

The following assumptions have to be met:

  1. At least there is one dataset inside the data and it should had the key name: train
  2. All provided datasets should have two elements: features and target
  3. The value of the key features is a pandas dataframe. The target/label should not be between the features
  4. The target is a numpy array

Main keys validator

The user has to provide at least those four keys inside the parameters dictionary.


metrics key validator

The following assumptions should be met:

  1. The value of the metrics should be a list.
  2. Currently, there are only two regression metrics: r2_score and mean_squared_error, and two classification metrics: accuracy_score and roc_auc_score

model key validator

The following assumptions should be met:

  1. The elements type and hyperparameters should be found inside the values of the key model.
  2. The type of the value of the hyperparameters should be a dictionary.

predict key validator

The following assumptions should be met:

  1. If the predict key exists, all of the datasets should have the key features
  2. The value of the features is a pandas dataframe

3. The datasets inside the key predict have no target or labels. It is required to predict the target for those datasets.


split key validator

The following assumptions should be met:

  1. There are two elements inside the split key: method and split_ratios or fold_nr
  2. If the value of the method element is split, the second element should be split_ratios
  3. If the value of the method element is kfold, the second element should be fold_nr
  4. The split_ratios can be either a float or set/list of two floats.
  5. The split_ratios values should be in ]0, 1[
  6. The value of the fold_nr should be an integer larger than 1
  7. The method can take only two values: split` or kfold

Parameters structure validator

Apply all validation methods defined inside the class StructureValidation

Parameters:parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations. list, stratify: bool, hyperparameters: dict, train_array: numpy.array, target: numpy.array, models_nr: list, model_type: str, required_metrics: list)

n split training

This function trains a model to fit the data using n split cross-validation e.g train, test or train, valid and test

  • test_split_ratios (list) – A list that contains the test split ratio e.g. [0.2] for testing size/training size or [0.2, 0.2] for validation size/training size and testing size/(training size - validation size)
  • stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
  • hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
  • train_array (np.array) – The values of the features that will be split to two sub-datasets based on the split value to multiple datasets.
  • target (np.array) – The values of the target that will be split to two sub-datasets based on the split value to multiple datasets.
  • models_nr (list) – A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there is only one model.
  • model_type (str) – The type of model that will be used to fit the data. Currently there are two values: Ridge linear regression and lightgbm.
  • required_metrics (list) –

  • models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there is only one model.
  • save_models_dir - The name of the directory where the trained models are saved locally. dict, stratify: bool, hyperparameters: dict, train_array: numpy.array, target: numpy.array, models_nr: list, model_type, required_metrics: list)

K-Fold cross-validation training

This function trains a model to fit the data using K-Fold cross-validation.

  • split (dict) – A dictionary that contains information about the K-Fold variables
  • stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
  • hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
  • train_array (np.array) – The values of the target that will be split into K-Folds and used to train the model to predict the target
  • target (np.array) – The values of the target that will be split into K-Folds and used to train the model.
  • models_nr (list) – A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there are n_fold models.
  • model_type (str) – The type of model that will be used to fit the data.
  • required_metrics (list) –

  • models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there are n_fold models.
  • save_models_dir - The name of the directory where the trained models are saved locally. dict)

Model training

This function trains a model to fit the data using the Scikit Learn of Ridge linear model implementation


parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations. Check the example below.


  • models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training.
  • save_models_dir - The name of the directory where the trained models are saved locally.

One split: train and test
>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "split",
>>>          "split_ratios": 0.2,
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": 1,
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }
Two splits: train, valid and test
>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "split",
>>>          "split_ratios": (0.2, 0.2), # or [0.2, 0.2]
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": 1,
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }
KFold cross-validation:
>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "kfold",
>>>          "fold_nr": 5,
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": 1,
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }
KFold cross-validation with alpha optimization:
>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "kfold",
>>>          "fold_nr": 5,
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": "optimize",
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }
training.optimizer.training_for_optimizing(alpha_i: float, x_train: numpy.array, y_train: numpy.array, x_test: numpy.array, y_test: numpy.array, help_text: str) → float


This function trains the ridge linear regression model given a certain alpha.

  • alpha_i (float) – A hyperparameter which used by the ridge linear regression to avoid over-fitting.
  • x_train (np.array) – The values of the features which are used to train the model to predict the target y_train
  • y_train (np.array) – The target values which are used to train the model.
  • x_test (np.array) – The values of the features which are used to evaluate the model by predicting the target y_test
  • y_test (np.array) – The target values which are used to evaluate the performance of the model based on the coefficient of determination R2
  • help_text (str) – A string to show useful information about the training cross-validation method

  • r2_linear - Coefficient of determination for a given alpha and testing dataset

training.optimizer.get_best_alpha_split(x_train: numpy.array, y_train: numpy.array, x_test: numpy.array, y_test: numpy.array) → float

alpha optimizer for two datasets split

This function finds the best alpha value based on the coefficient of determination. This function will be replaced by a native optimization method from other packages

  • x_train (np.array) – The values of the features which are used to train the model to predict the target y_train
  • y_train (np.array) – The target values which are used to train the model.
  • x_test (np.array) – The values of the features which are used to evaluate the model by predicting the target y_test
  • y_test (np.array) – The target values which are used to evaluate the performance of the model based on the coefficient of determination R2

best_alpha: The alpha value that maximize the coefficient of determination R2

training.optimizer.get_best_alpha_kfold(kfold, train_array: numpy.array, target: numpy.array)

alpha optimizer for K-Fold cross validation

  • kfold
  • train_array – The values of the features which are used to train the model to predict the target target
  • target – The target values which are used to train the model.

best_alpha: The alpha value that maximize the coefficient of determination R2

training.xgboost_train.xgboost_data_preparation(validation_list: list, dataframe: pandas.core.frame.DataFrame, target: numpy.array, key: str)

xgboost data preparing for training

The function transforms the data from a Pandas dataframe format to a xgboost-compatible format.

  • validation_list (list) – The list that contains the data the should be used to train and validate the model.
  • dataframe (pd.DataFrame) – Pandas dataframe that contains the data which will be transformed to xgboost format.
  • target (np.array) – An array that contains the target that should be predict by the xgboost model
  • key (str) – A label that is used to name the dataset in the validation_list

  • The updated validation_list

training.xgboost_train.xgboost_regression_train(validation_list: list, hyperparameters: dict, num_round: int = 10)

xgboost trainer

The function uses the xgboost framework to train the model

  • validation_list (list) – The list that contains the data the should be used to train and validate the model.
  • hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
  • num_round (int) – The number of rounds for boosting

  • xgboost model

training.xgboost_train.xgboost_data_preparation_to_predict(dataframe: pandas.core.frame.DataFrame)

xgboost data preparing for prediction

The function transforms the data from a Pandas dataframe format to a xgboost-compatible format

Parameters:dataframe (pd.DataFrame) – Pandas dataframe that contains the data which will be transformed to xgboost format.
  • The dataset in xgboost-compatible format
training.xgboost_train.training_xgboost_n_split(sub_datasets: dict, hyperparameters: dict, num_round: int = 10)

XGboost training with n-split

This function trains a model to fit the data using n split cross-validation e.g train, test or train, valid and test

  • sub_datasets (dict) –
  • hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method need to train the model.
  • num_round (int) – The number of rounds for boosting

  • model: xgboost model.
  • problem_to_solve: string that defines the problem to solve: regression or classification.
  • validation_list: The list that contains the data the should be used to train and validate the model.

training.xgboost_train.training_xgboost_kfold(train_array, target, train: list, test: list, hyperparameters: dict, num_round: int = 10)

XGboost training with kfold

This function trains a model to fit the data using K-Fold cross-validation.

  • train_array (np.array) – The values of the target that will be split into K-Folds and used to train the model to predict the target
  • target (np.array) – The values of the target that will be split into K-Folds and used to train the model.
  • train (list) – A list of integers that define the training dataset
  • test (list) – A list of integers that define the testing dataset
  • hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
  • num_round (int) – The number of rounds for boosting

  • kfold_model: xgboost model
  • problem_to_solve: string that defines the problem to solve: regression or classification.
  • validation_list: The list that contains the data the should be used to train and validate the model.

training.xgboost_train.get_num_round(hyperparameters) → int

num_round getter

Get the value of num_round that will be used to train the xgboost model

Parameters:hyperparameters – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
  • num_round: The number of rounds for boosting
training.xgboost_train.xgboost_data_preparation_for_evaluation(data: dict)

Date preparation for evaluation

Prepare the data in a form that could be used for model evaluation.

training.model_evaluator.load_all_models(save_models_dir: str, model_type: str, model_i: int)

Model loader

Load saved models from a given type.

  • save_models_dir (str) – directory where the model is saved
  • model_type (str) – regression or classification
  • model_i (int) – index used to distinguish models of the same type trained on different datasets.

training.model_evaluator.evaluate_model(model, xs: list, ys: list, labels: list, metrics: list)

Model evaluator

This function shows the value of the matrices R2 and MSE for different datasets when evaluating the trained model.

  • model – An object created by the training package e.g. Scikit Learn.
  • xs (list) – Every element is a np.array of the features that are used to predict the target variable.
  • ys (list) – Every element is a np.array of the target variable.
  • labels (list) – Every element is a string that is used to label every (x,y) pair and refers to their origin.
  • metrics (list) – list of metrics used to evaluate model.

metrics_summary (dict): all metrics from metrics applied to all (y, y_pred=model(x)) paris.

training.utils.read_kfold_config(split: dict)

KFold values reader

This function ensures that the parameters of the KFold splitting method are defined.

Parameters:split (dict) – A dictionary that contains the parameters about the KFold splitting method.
  • n_fold - An integer that refers to the number of folds which will be used for cross-validation.
  • shuffle - A boolean. If true, data will be shuffled before splitting it to multiple folds.
  • random_state - An integer which helps to reproduce the results.
training.utils.create_model_directory(path: str)

Model directory creator

This function create a directory where the model during and after training will be saved.

Parameters:path (str) – It refers to the location where the models should be saved.
training.utils.save_model_locally(path: str, model: object)

Model saver

This function saves the model locally in pickle format.

  • path (str) – It refers to the location where the models should be saved.
  • model (object) – An object created by the training package e.g. Scikit Learn.
training.utils.input_parameters_extraction(parameters: dict)

Input data parsing

Parameters:parameters – dict parameters: A dictionary that contains information about the datasets, model type, model configurations and training configurations. Check the example below.
  • data - A dictionary that contains pandas dataframes as datasets.
  • split - A dictionary that contains information about the cross-validation method.
  • train_array - A numpy array that is used to train the model and predict the target.
  • target - A numpy array that is used to train the model.
  • predict - If provided, a pandas dataframe that contains the features without the labels (target). Otherwise bool: False
training.utils.split_dataset(features: numpy.array, target: numpy.array, tests_split_ratios: Union[list, set], stratify: bool = False) → dict

Dataset splitter

This function split a dataset to multiple datasets such train, valid and test.

  • features (np.array) – The original dataset that should be split to subsets
  • target (np.array) – The original target/labels dataset that should be predicted
  • set] tests_split_ratios (Union[list,) – A list or set of floats that represent the ratio of the size of the test dataset ot the train dataset. The values should be in the range ]0, 1[ e.g. tests_split_ratios = [0.2, 0.2]
  • stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.

sub_datasets: A dictionary that contains the test and train dataset

Return type:
