Training¶
-
class
training.validator.
StructureValidation
(parameters)¶ Parameters structure validation
Validate the input parameters and raise an Exception if the structure of the parameters is invalid
Parameters: parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations. Raise: ValueError:TypeError:-
__init__
(parameters)¶ Parameters: parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations.
-
features_validator
()¶ Features validator
All datasets given inside the parameters object should have the same features.
Returns:
-
validate_data
()¶ data key validator
The following assumptions have to be met:
- At least there is one dataset inside the data and it should had the key name: train
- All provided datasets should have two elements: features and target
- The value of the key features is a pandas dataframe. The target/label should not be between the features
- The target is a numpy array
-
validate_main_keys
()¶ Main keys validator
The user has to provide at least those four keys inside the parameters dictionary.
-
validate_metrics
()¶ metrics key validator
The following assumptions should be met:
- The value of the metrics should be a list.
- Currently, there are only two regression metrics: r2_score and mean_squared_error, and two classification metrics: accuracy_score and roc_auc_score
-
validate_model
()¶ model key validator
The following assumptions should be met:
- The elements type and hyperparameters should be found inside the values of the key model.
- The type of the value of the hyperparameters should be a dictionary.
-
validate_predict
()¶ predict key validator
The following assumptions should be met:
- If the predict key exists, all of the datasets should have the key features
- The value of the features is a pandas dataframe
3. The datasets inside the key predict have no target or labels. It is required to predict the target for those datasets.
-
validate_split
()¶ split key validator
The following assumptions should be met:
- There are two elements inside the split key: method and split_ratios or fold_nr
- If the value of the method element is split, the second element should be split_ratios
- If the value of the method element is kfold, the second element should be fold_nr
- The split_ratios can be either a float or set/list of two floats.
- The split_ratios values should be in ]0, 1[
- The value of the fold_nr should be an integer larger than 1
- The method can take only two values: split` or kfold
-
-
training.validator.
parameters_validator
(parameters)¶ Parameters structure validator
Apply all validation methods defined inside the class StructureValidation
Parameters: parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations.
-
training.training.
train_with_n_split
(test_split_ratios: list, stratify: bool, hyperparameters: dict, train_array: numpy.array, target: numpy.array, models_nr: list, model_type: str, required_metrics: list)¶ n split training
This function trains a model to fit the data using n split cross-validation e.g train, test or train, valid and test
Parameters: - test_split_ratios (list) – A list that contains the test split ratio e.g. [0.2] for testing size/training size or [0.2, 0.2] for validation size/training size and testing size/(training size - validation size)
- stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
- hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
- train_array (np.array) – The values of the features that will be split to two sub-datasets based on the split value to multiple datasets.
- target (np.array) – The values of the target that will be split to two sub-datasets based on the split value to multiple datasets.
- models_nr (list) – A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there is only one model.
- model_type (str) – The type of model that will be used to fit the data. Currently there are two values: Ridge linear regression and lightgbm.
- required_metrics (list) –
Returns: - models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there is only one model.
- save_models_dir - The name of the directory where the trained models are saved locally.
-
training.training.
train_with_kfold_cross_validation
(split: dict, stratify: bool, hyperparameters: dict, train_array: numpy.array, target: numpy.array, models_nr: list, model_type, required_metrics: list)¶ K-Fold cross-validation training
This function trains a model to fit the data using K-Fold cross-validation.
Parameters: - split (dict) – A dictionary that contains information about the K-Fold variables
- stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
- hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
- train_array (np.array) – The values of the target that will be split into K-Folds and used to train the model to predict the target
- target (np.array) – The values of the target that will be split into K-Folds and used to train the model.
- models_nr (list) – A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there are n_fold models.
- model_type (str) – The type of model that will be used to fit the data.
- required_metrics (list) –
Returns: - models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training. In this case there are n_fold models.
- save_models_dir - The name of the directory where the trained models are saved locally.
-
training.training.
model_training
(parameters: dict)¶ Model training
This function trains a model to fit the data using the Scikit Learn of Ridge linear model implementation
Parameters: parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations. Check the example below.
Returns: - models_nr - A list of indexes that will be used to point to the trained models which will be saved locally after training.
- save_models_dir - The name of the directory where the trained models are saved locally.
Example: - One split: train and test
>>> parameters = { >>> "data": { >>> "train": {"features": train_dataframe, "target": train_target}, >>> "valid": {"features": valid_dataframe, "target": valid_target}, # optional >>> "test": {"features": test_dataframe, "target": test_target}, # optional >>> }, >>> "split": { >>> "method": "split", >>> "split_ratios": 0.2, >>> }, >>> "model": {"type": "Ridge linear regression", >>> "hyperparameters": {"alpha": 1, >>> }, >>> }, >>> "metrics": ["r2_score", "mean_squared_error"], >>> "predict": { # optional >>> "test": {"features": test_dataframe} >>> } >>> }
- Two splits: train, valid and test
>>> parameters = { >>> "data": { >>> "train": {"features": train_dataframe, "target": train_target}, >>> "valid": {"features": valid_dataframe, "target": valid_target}, # optional >>> "test": {"features": test_dataframe, "target": test_target}, # optional >>> }, >>> "split": { >>> "method": "split", >>> "split_ratios": (0.2, 0.2), # or [0.2, 0.2] >>> }, >>> "model": {"type": "Ridge linear regression", >>> "hyperparameters": {"alpha": 1, >>> }, >>> }, >>> "metrics": ["r2_score", "mean_squared_error"], >>> "predict": { # optional >>> "test": {"features": test_dataframe} >>> } >>> }
- KFold cross-validation:
>>> parameters = { >>> "data": { >>> "train": {"features": train_dataframe, "target": train_target}, >>> "valid": {"features": valid_dataframe, "target": valid_target}, # optional >>> "test": {"features": test_dataframe, "target": test_target}, # optional >>> }, >>> "split": { >>> "method": "kfold", >>> "fold_nr": 5, >>> }, >>> "model": {"type": "Ridge linear regression", >>> "hyperparameters": {"alpha": 1, >>> }, >>> }, >>> "metrics": ["r2_score", "mean_squared_error"], >>> "predict": { # optional >>> "test": {"features": test_dataframe} >>> } >>> }
- KFold cross-validation with alpha optimization:
>>> parameters = { >>> "data": { >>> "train": {"features": train_dataframe, "target": train_target}, >>> "valid": {"features": valid_dataframe, "target": valid_target}, # optional >>> "test": {"features": test_dataframe, "target": test_target}, # optional >>> }, >>> "split": { >>> "method": "kfold", >>> "fold_nr": 5, >>> }, >>> "model": {"type": "Ridge linear regression", >>> "hyperparameters": {"alpha": "optimize", >>> }, >>> }, >>> "metrics": ["r2_score", "mean_squared_error"], >>> "predict": { # optional >>> "test": {"features": test_dataframe} >>> } >>> }
-
training.optimizer.
training_for_optimizing
(alpha_i: float, x_train: numpy.array, y_train: numpy.array, x_test: numpy.array, y_test: numpy.array, help_text: str) → float¶ Trainer
This function trains the ridge linear regression model given a certain alpha.
Parameters: - alpha_i (float) – A hyperparameter which used by the ridge linear regression to avoid over-fitting.
- x_train (np.array) – The values of the features which are used to train the model to predict the target y_train
- y_train (np.array) – The target values which are used to train the model.
- x_test (np.array) – The values of the features which are used to evaluate the model by predicting the target y_test
- y_test (np.array) – The target values which are used to evaluate the performance of the model based on the coefficient of determination R2
- help_text (str) – A string to show useful information about the training cross-validation method
Returns: - r2_linear - Coefficient of determination for a given alpha and testing dataset
-
training.optimizer.
get_best_alpha_split
(x_train: numpy.array, y_train: numpy.array, x_test: numpy.array, y_test: numpy.array) → float¶ alpha optimizer for two datasets split
This function finds the best alpha value based on the coefficient of determination. This function will be replaced by a native optimization method from other packages
Parameters: - x_train (np.array) – The values of the features which are used to train the model to predict the target y_train
- y_train (np.array) – The target values which are used to train the model.
- x_test (np.array) – The values of the features which are used to evaluate the model by predicting the target y_test
- y_test (np.array) – The target values which are used to evaluate the performance of the model based on the coefficient of determination R2
Returns: best_alpha: The alpha value that maximize the coefficient of determination R2
-
training.optimizer.
get_best_alpha_kfold
(kfold, train_array: numpy.array, target: numpy.array)¶ alpha optimizer for K-Fold cross validation
Parameters: - kfold –
- train_array – The values of the features which are used to train the model to predict the target target
- target – The target values which are used to train the model.
Returns: best_alpha: The alpha value that maximize the coefficient of determination R2
-
training.xgboost_train.
xgboost_data_preparation
(validation_list: list, dataframe: pandas.core.frame.DataFrame, target: numpy.array, key: str)¶ xgboost data preparing for training
The function transforms the data from a Pandas dataframe format to a xgboost-compatible format.
Parameters: - validation_list (list) – The list that contains the data the should be used to train and validate the model.
- dataframe (pd.DataFrame) – Pandas dataframe that contains the data which will be transformed to xgboost format.
- target (np.array) – An array that contains the target that should be predict by the xgboost model
- key (str) – A label that is used to name the dataset in the validation_list
Returns: - The updated validation_list
-
training.xgboost_train.
xgboost_regression_train
(validation_list: list, hyperparameters: dict, num_round: int = 10)¶ xgboost trainer
The function uses the xgboost framework to train the model
Parameters: - validation_list (list) – The list that contains the data the should be used to train and validate the model.
- hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
- num_round (int) – The number of rounds for boosting
Returns: - xgboost model
-
training.xgboost_train.
xgboost_data_preparation_to_predict
(dataframe: pandas.core.frame.DataFrame)¶ xgboost data preparing for prediction
The function transforms the data from a Pandas dataframe format to a xgboost-compatible format
Parameters: dataframe (pd.DataFrame) – Pandas dataframe that contains the data which will be transformed to xgboost format. Returns: - The dataset in xgboost-compatible format
-
training.xgboost_train.
training_xgboost_n_split
(sub_datasets: dict, hyperparameters: dict, num_round: int = 10)¶ XGboost training with n-split
This function trains a model to fit the data using n split cross-validation e.g train, test or train, valid and test
Parameters: - sub_datasets (dict) –
- hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method need to train the model.
- num_round (int) – The number of rounds for boosting
Returns: - model: xgboost model.
- problem_to_solve: string that defines the problem to solve: regression or classification.
- validation_list: The list that contains the data the should be used to train and validate the model.
-
training.xgboost_train.
training_xgboost_kfold
(train_array, target, train: list, test: list, hyperparameters: dict, num_round: int = 10)¶ XGboost training with kfold
This function trains a model to fit the data using K-Fold cross-validation.
Parameters: - train_array (np.array) – The values of the target that will be split into K-Folds and used to train the model to predict the target
- target (np.array) – The values of the target that will be split into K-Folds and used to train the model.
- train (list) – A list of integers that define the training dataset
- test (list) – A list of integers that define the testing dataset
- hyperparameters (dict) – A dictionary that contains the hyperparameters which the selected training method needs to train the model.
- num_round (int) – The number of rounds for boosting
Returns: - kfold_model: xgboost model
- problem_to_solve: string that defines the problem to solve: regression or classification.
- validation_list: The list that contains the data the should be used to train and validate the model.
-
training.xgboost_train.
get_num_round
(hyperparameters) → int¶ num_round getter
Get the value of num_round that will be used to train the xgboost model
Parameters: hyperparameters – A dictionary that contains the hyperparameters which the selected training method needs to train the model. Returns: - num_round: The number of rounds for boosting
-
training.xgboost_train.
xgboost_data_preparation_for_evaluation
(data: dict)¶ Date preparation for evaluation
Prepare the data in a form that could be used for model evaluation.
Parameters: data – Returns:
-
training.model_evaluator.
load_all_models
(save_models_dir: str, model_type: str, model_i: int)¶ Model loader
Load saved models from a given type.
Parameters: - save_models_dir (str) – directory where the model is saved
- model_type (str) – regression or classification
- model_i (int) – index used to distinguish models of the same type trained on different datasets.
Returns:
-
training.model_evaluator.
evaluate_model
(model, xs: list, ys: list, labels: list, metrics: list)¶ Model evaluator
This function shows the value of the matrices R2 and MSE for different datasets when evaluating the trained model.
Parameters: - model – An object created by the training package e.g. Scikit Learn.
- xs (list) – Every element is a np.array of the features that are used to predict the target variable.
- ys (list) – Every element is a np.array of the target variable.
- labels (list) – Every element is a string that is used to label every (x,y) pair and refers to their origin.
- metrics (list) – list of metrics used to evaluate model.
Returns: metrics_summary (dict): all metrics from metrics applied to all (y, y_pred=model(x)) paris.
-
training.utils.
read_kfold_config
(split: dict)¶ KFold values reader
This function ensures that the parameters of the KFold splitting method are defined.
Parameters: split (dict) – A dictionary that contains the parameters about the KFold splitting method. Returns: - n_fold - An integer that refers to the number of folds which will be used for cross-validation.
- shuffle - A boolean. If true, data will be shuffled before splitting it to multiple folds.
- random_state - An integer which helps to reproduce the results.
-
training.utils.
create_model_directory
(path: str)¶ Model directory creator
This function create a directory where the model during and after training will be saved.
Parameters: path (str) – It refers to the location where the models should be saved.
-
training.utils.
save_model_locally
(path: str, model: object)¶ Model saver
This function saves the model locally in pickle format.
Parameters: - path (str) – It refers to the location where the models should be saved.
- model (object) – An object created by the training package e.g. Scikit Learn.
-
training.utils.
input_parameters_extraction
(parameters: dict)¶ Input data parsing
Parameters: parameters – dict parameters: A dictionary that contains information about the datasets, model type, model configurations and training configurations. Check the example below. Returns: - data - A dictionary that contains pandas dataframes as datasets.
- split - A dictionary that contains information about the cross-validation method.
- train_array - A numpy array that is used to train the model and predict the target.
- target - A numpy array that is used to train the model.
- predict - If provided, a pandas dataframe that contains the features without the labels (target). Otherwise bool: False
-
training.utils.
split_dataset
(features: numpy.array, target: numpy.array, tests_split_ratios: Union[list, set], stratify: bool = False) → dict¶ Dataset splitter
This function split a dataset to multiple datasets such train, valid and test.
Parameters: - features (np.array) – The original dataset that should be split to subsets
- target (np.array) – The original target/labels dataset that should be predicted
- set] tests_split_ratios (Union[list,) – A list or set of floats that represent the ratio of the size of the test dataset ot the train dataset. The values should be in the range ]0, 1[ e.g. tests_split_ratios = [0.2, 0.2]
- stratify (bool) – If set to True the ratios of the labels is kept the same in the splitted data sets.
Returns: sub_datasets: A dictionary that contains the test and train dataset
Return type: dict