Feature Engineering

feature_engineering.feature_generator.valid_features_detector(dataframe: dict, categorical_features: list, class_number_range: list) → list

Feature validator

The functions checks if the one-hot encoding method should be applied to the given features.

Parameters:
  • dataframe (dict) – A pandas dataframe which contain the dataset
  • categorical_features (list) – A list of string that contains the name of the columns or features that contain categorical data type.
  • class_number_range (list) –

    It is a list of two elements which define the minimum the and maximum number of the classes (unique value) that a feature should contain in order to apply the one-hot encoding

    to this feature.
Returns:

valid_features: A list of the features which the encoding will be applied to.

Return type:

list

feature_engineering.feature_generator.encoding_features(encoding_type: str, dataframes_dict: dict, reference: str, categorical_features: list, ignore_columns: list, class_number_range: list = [0, 50], target_name: str = None) → dict

One-hot encoder

The function applies one-hot encoding to the categorical features using the Scikit Learn framework implementation.

Parameters:
  • encoding_type (str) –

    The type of the encoding method that will be applied. For example: one-hot, target

    For more information please check the following reference:

    https://contrib.scikit-learn.org/categorical-encoding/index.html

  • dataframes_dict (dict) – A dictionary that contains the dataframes before applying the encoding e.g. dataframes_dict={ ‘train’: train_dataframe, ‘test’: ‘test_dataframe’}
  • reference (str) – The name of the dataframe that will be considered when validating the type of the data
  • categorical_features (list) – A list of string that contains the name of the columns or features that contain categorical data type.
  • class_number_range (list) – A list that contains two integers which refer ot the range of the minimum and the maximum number of the labels/classes/ categories. If a number of the categories of the feature is not in that defined range, one-hot encoding will be not applied to that feature. It can be ignored if the encoding type is not one-hot.
  • ignore_columns (list) – list of strings which are the columns names. The encoding will not be applied to those columns.
  • target_name (str) – The name of the column that contains the labels that should be predicted by the model. If the encoding method doesn’t require that target, it can be ignored.
Returns:

dataframes_dict_encoded: A dictionary that contains the dataframes after applying feature encoding e.g. dataframes_dict_encoded={ ‘train’: train_dataframe, ‘test’: ‘test_dataframe’}

Return type:

dict