Flows

Flows pieces collector

This module contains the Flows class, which is a container of the methods that construct the flows.

The modules depends on the classes and methods which are writen in other package in this project.

class flows.flows.Flows(flow_id: int, categorical_threshold: int = 50, kl_div_threshold: float = 0.05)

Flows methods container

A class which meant to be a container for all flows elements.

Parameters:
  • flow_id (int) – An integer which points to the flow that the use wants to follow.
  • categorical_threshold (int) – The maximum number of categories that a categorical feature should have before considering it as continuous numeric feature.
  • commands (object) – It contains the list of the instructions that are loaded from the yaml file
  • columns_set (list) – A dictionary that contains the features’ names sorted in multiple lists based on the type of the data for each given dataset.
Methods:
  • guidance - Evaluate YAML commands
  • load_data - Read CSV data
  • encode_categorical_feature - Encode the categorical features by changing the string value to numeric values
  • scale_data - Scale the numerical value (Feature Standardization - mean = 0, STD = 1)
  • one_hot_encoding - Encode categorical features using one-hot encoding method
  • training_ridge_linear_model - Train a model using the regression Scikit Learn Ridge linear model implementation
  • training_lightgbm - Train a tree-based regression model using the LightGBM implementation
__init__(flow_id: int, categorical_threshold: int = 50, kl_div_threshold: float = 0.05)
Parameters:
  • flow_id (int) – An integer which points to the flow that the use wants to follow.
  • categorical_threshold (int) – The maximum number of categories that a categorical feature should have before considering it as continuous numeric feature.
static comparing_statistics(dataframe_dict: dict)

Datasets statistics visualizer

This function visualize the statistical properties of the given datasets. It plots Those properties in a single graph which help obtain an overview about the distribution of the data in different datasets. It is an interactive function. Therefore, it was designed to run in a Jupyter notebook.

Parameters:dataframe_dict (dict) – A dictionary that contains Pandas dataframes e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}
drop_columns_constant_values(dataframes_dict: dict, ignore_columns: list, drop_columns: bool = True, print_columns: bool = True, _reference: Union[bool, str] = False)

Constant value features eliminator

Parameters:
  • dataframes_dict (dict) – A dictionary that contains Pandas dataframes e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}
  • ignore_columns (list) – It contains the columns that should be ignored e.g. the id and the target.
  • drop_columns (bool) – If true, the columns that contain constant values along all the rows will be dropped.
  • print_columns (bool) – If true, information about the columns that contain constant values will be printed to the console
  • str] _reference (Union[bool,) – The reference dataframe which is used when applying functions to other dataframes. The default value is the first dataframe inside the dataframes_dict dictionary. Usually it is the train dataframe
Returns:

  • dataframes_dict - A dictionary that contains Pandas dataframes after dropping features with constant values e.g. dataframes_dict={ “train”: train_dataframe, “test”: test_dataframe}
  • columns_set - A dictionary that contains the features’ names sorted in multiple lists based on the type of the data for each given dataset.

drop_correlated_columns(dataframes_dict: dict, ignore_columns: list, drop_columns: bool = True, print_columns: bool = True, threshold: float = 0.98, _reference: Union[bool, str] = False)

Correlation eliminator

The function drop correlated columns and keep only one of these columns.

Parameters:
  • dataframes_dict (dict) – A dictionary that contains Pandas dataframes e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}
  • ignore_columns (list) – It contains the columns that should be ignored e.g. the id and the target.
  • drop_columns (bool) – If true, all correlated columns will be dropped but one.
  • print_columns (bool) – If True, information about the correlated columns will be printed to the console.
  • threshold (float) – A value between 0 and 1. If the correlation between two columns is larger than this. value, they are considered highly correlated. If drop_columns is True, one of those columns will be dropped. The recommended value of the threshold is in [0.7 … 1].
  • str] _reference (Union[bool,) – The reference dataframe which is used when applying functions to other dataframes. The default value is the first dataframe inside the dataframes_dict dictionary. Usually it is the train dataframe
Returns:

  • dataframes_dict - A dictionary that contains Pandas dataframes after dropping correlated columns e.g. dataframes_dict={ “train”: train_dataframe, “test”: test_dataframe}
  • columns_set - A dictionary that contains the features’ names sorted in multiple lists based on the type of the data for each given dataset.

encode_categorical_feature(dataframes_dict: dict, print_results: Union[bool, int] = False, _reference: Union[bool, str] = False)

Categorical features encoder

This function encodes the categorical features by replacing the strings with integers

Parameters:
  • int] print_results (Union[bool,) – If False, no data is printed to the console. If True, all data is printed to the console. If an integer n, only the data for n features is printed to the console.
  • dataframes_dict (dict) – A dictionary that contains Pandas dataframes before encoding the features e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}
  • str] _reference (Union[bool,) – The reference dataframe which is used when applying functions to other dataframes. The default value is the first dataframe inside the dataframes_dict dictionary. Usually it is the train dataframe
Returns:

  • dataframes_dict_encoded - A dictionary that contains Pandas dataframes after encoding the features e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}
  • columns_set - A dictionary that contains the features’ names sorted in multiple lists based on the type of the data for each given dataset.

static exploring_data(dataframe_dict: dict, key_i: str)

Datasets explorer

This functions explore a given dataset by showing information about the most and the least repeated value, the number of unique values and the distribution of each feature. It is an interactive function. Therefore, it was designed to run in a Jupyter notebook.

Parameters:
  • dataframe_dict (dict) – A dictionary that contains Pandas dataframes e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}
  • key_i (str) – It points to the dataset which the user wants to explore e.g. “train”.
features_encoding(encoding_type: str, dataframe_dict: dict, reference: str, ignore_columns: list, class_number_range=[3, 50], target_name: str = None)

The encoder

This function encodes the categorical features using different encoding methods It assumes that the user encoded the categorical features with string values by replacing the those string values by integers

Parameters:
  • encoding_type (str) –

    The type of the encoding method that will be applied. For example: one-hot, target

    For more information please check the following reference:

    https://contrib.scikit-learn.org/categorical-encoding/index.html

  • dataframe_dict (dict) – A dictionary that contains Pandas dataframes before the encoding features e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}.
  • reference (str) – It is the key ind the dataframes dictionary that points to the dataframe which its inputs are taken as a reference to encode the data of other dataframes e.g. “train”.
  • ignore_columns (list) – It is a list of strings that contains the name of the columns which should be ignored when applying the encoding method.
  • class_number_range (list) – It is a list of two elements which define the minimum the and maximum number of the classes (unique value) that a feature should contain in order to apply the one-hot encoding to this feature.
  • target_name (str) – The name of the column that contains the labels that should be predicted by the model. If the encoding method doesn’t require that target, it can be ignored.
Returns:

  • dataframe_dict_encoded - A dictionary that contains Pandas dataframes after the encoding features e.g. dataframe_dict_encoded={“train”: train_dataframe, “test”: test_dataframe}.
  • columns_set - A dictionary that contains the features’ names sorted in multiple lists based on the type of the data for each given dataset.

flatten_json_data(dataframes_dict: dict, _reference: Union[bool, str] = False)

JSON data normalizer

This function normalizes the nested JSON data type inside the pandas dataframes’ columns. The name of the new
columns has the same name of the parent column with a predefined suffix to ensure unique columns’ names.
Parameters:
  • dataframes_dict (dict) – A dictionary that contains Pandas dataframes with nested JSON data type e.g. dataframes_dict={ “train”: train_dataframe, “test”: test_dataframe}
  • str] _reference (Union[bool,) – The reference dataframe which is used when applying functions to other dataframes. The default value is the first dataframe inside the dataframes_dict dictionary. Usually it is the train dataframe
Returns:

  • dataframes_dict - A dictionary that contains Pandas dataframes after flatting the nested JSON data type e.g. dataframes_dict={ “train”: train_dataframe, “test”: test_dataframe}
  • columns_set - A dictionary that contains the features’ names sorted in multiple lists based on the type of the data for each given dataset.

guidance(step_ext: object)

YAML evaluator

This function executes the command that is written in the yaml file under a certain step

Parameters:step_ext (object) – It can be an integer that points to a certain step e.g. 1 or a combination of both an integer and a letter to point to a sub-step e.g. 1_a
load_data(path: str, files_list: list, rows_amount: int = 0)

Data reader

This function reads data from CSV files and returns a dictionary that contains Pandas dataframes e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}

After reading the data, the function provides a summary of each dataset.

After presenting the summary, the function tries to detect which column may contain the ids and which column can be the target (labels). Based on the values of the target, the function can tell if the problem which should be solved is a regression or classification problem.

Parameters:
  • path (str) – The path to the data.
  • files_list (list) – A list of strings which are the names of the files
  • rows_amount (int) – The number of rows that should be read from the CSV file. If 0, all rows will be read.
Returns:

  • dataframes_dict - A dictionary that contains Pandas dataframes.
  • columns_set - A dictionary that contains the features’ names sorted in multiple lists based on the type of the data for each given dataset.

Example:
>>> path = "./data"
>>> files_list = ['train.csv','test.csv']
>>> dataframes_dict={"train": train_dataframe, "test": test_dataframe}
scale_data(dataframes_dict: dict, ignore_columns: list, _reference: Union[bool, str] = False)

Feature scaling

This function scales features that contains numeric continuous values.

Parameters:
  • dataframes_dict (dict) – A dictionary that contains Pandas dataframes before scaling features e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}
  • ignore_columns (list) – It contains the columns that should be ignored when apply scaling e.g. the id and the target.
  • str] _reference (Union[bool,) – The reference dataframe which is used when applying functions to other dataframes. The default value is the first dataframe inside the dataframes_dict dictionary. Usually it is the train dataframe
Returns:

  • dataframes_dict_scaled - A dictionary that contains Pandas dataframes after scaling features e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}
  • columns_set - A dictionary that contains the features’ names sorted in multiple lists based on the type of the data for each given dataset.

training(parameters: dict)

Ridge linear model

This function fits the data using the ridge linear model. The function uses the implementation from scikit Learn. The user can train a model with specific configuration using the parameter variable.

Parameters:

parameters (dict) – A dictionary that contains information about the datasets, model type, model configurations and training configurations. Check the example below.

Returns:

  • model_index_list - A list of indexes that will be used to point to the trained models which will be saved locally after training
  • save_models_dir - The path where the models will be saved.
  • y_predict - numpy array If the predict key is given, the model my_model predict the labels of the parameters[“predict”][“test”] dataset gives back y_predict.

Example:
>>> parameters = {
>>>      "data": {
>>>          "train": {"features": train_dataframe, "target": train_target},
>>>          "valid": {"features": valid_dataframe, "target": valid_target}, # optional
>>>          "test": {"features": test_dataframe, "target": test_target}, # optional
>>>      },
>>>      "split": {
>>>          "method": "split",
>>>          "split_ratios": 0.2,
>>>      },
>>>      "model": {"type": "Ridge linear regression",
>>>                "hyperparameters": {"alpha": 1,
>>>                                    },
>>>                },
>>>      "metrics": ["r2_score", "mean_squared_error"],
>>>      "predict": { # optional
>>>          "test": {"features": test_dataframe}
>>>      }
>>>  }
update_data_summary(dataframes_dict: dict) → dict

Data type updater

This function update the list of the features in the columns types dictionary. This function should be used in
case of modifying the features in a dataset manually. For example, dropping some features or after joining two datasets.
Parameters:dataframes_dict (dict) – A dictionary that contains Pandas dataframes e.g. dataframes_dict={“train”: train_dataframe, “test”: test_dataframe}
Returns:
  • columns_set - A dictionary that contains the features’ names sorted in multiple lists based on the type of the data for each given dataset.
flows.utils.unify_dataframes(dataframes_dict: dict, _reference: str, ignore_columns: list) → dict

Dataframes unifier

This function ensures that all datasets have the same features after dropping highly correlated features or columns that have constant values.

Parameters:
  • dataframes_dict (dict) – A dictionary that contains Pandas dataframes with nested JSON data type e.g. dataframes_dict={ “train”: train_dataframe, “test”: test_dataframe}
  • _reference (str) – The name of the Pandas dataframe that will be used as a reference to adjust the features of other dataframes. Usually it is the train dataframe
  • ignore_columns (list) – It contains the columns that should be ignored when apply scaling e.g. the id and the target.
Returns:

  • dataframes_dict - A dictionary that contains Pandas dataframes.