Feature Engineering¶
-
feature_engineering.feature_generator.
valid_features_detector
(dataframe: dict, categorical_features: list, class_number_range: list) → list¶ Feature validator
The functions checks if the one-hot encoding method should be applied to the given features.
Parameters: - dataframe (dict) – A pandas dataframe which contain the dataset
- categorical_features (list) – A list of string that contains the name of the columns or features that contain categorical data type.
- class_number_range (list) –
It is a list of two elements which define the minimum the and maximum number of the classes (unique value) that a feature should contain in order to apply the one-hot encoding
to this feature.
Returns: valid_features: A list of the features which the encoding will be applied to.
Return type: list
-
feature_engineering.feature_generator.
encoding_features
(encoding_type: str, dataframes_dict: dict, reference: str, categorical_features: list, ignore_columns: list, class_number_range: list = [0, 50], target_name: str = None) → dict¶ One-hot encoder
The function applies one-hot encoding to the categorical features using the Scikit Learn framework implementation.
Parameters: - encoding_type (str) –
The type of the encoding method that will be applied. For example: one-hot, target
For more information please check the following reference:
https://contrib.scikit-learn.org/categorical-encoding/index.html
- dataframes_dict (dict) – A dictionary that contains the dataframes before applying the encoding e.g. dataframes_dict={ ‘train’: train_dataframe, ‘test’: ‘test_dataframe’}
- reference (str) – The name of the dataframe that will be considered when validating the type of the data
- categorical_features (list) – A list of string that contains the name of the columns or features that contain categorical data type.
- class_number_range (list) – A list that contains two integers which refer ot the range of the minimum and the maximum number of the labels/classes/ categories. If a number of the categories of the feature is not in that defined range, one-hot encoding will be not applied to that feature. It can be ignored if the encoding type is not one-hot.
- ignore_columns (list) – list of strings which are the columns names. The encoding will not be applied to those columns.
- target_name (str) – The name of the column that contains the labels that should be predicted by the model. If the encoding method doesn’t require that target, it can be ignored.
Returns: dataframes_dict_encoded: A dictionary that contains the dataframes after applying feature encoding e.g. dataframes_dict_encoded={ ‘train’: train_dataframe, ‘test’: ‘test_dataframe’}
Return type: dict
- encoding_type (str) –