Preprocessing

class preprocessing.data_type_detector.ColumnDataFormat(dataframe: pandas.core.frame.DataFrame)

Data type detector

This class contains methods that help to detect the type of the data in each column of the given dataframe. The supported data types for now are: date, categorical feature with string values, numeric features with integer values (categorical), numeric with continuous values, and nested JSON data format.

Param:

pd.DataFrame dataframe: A pandas dataframe that contains the dataset e.g. train_dataframe.

Methods:
  • find_date_columns - Date data type finder
  • number_or_string - Numeric-string finder
  • json_detector- Valid JSON data finder
  • categorical_or_numeric - Numeric continuous-discrete finder
__init__(dataframe: pandas.core.frame.DataFrame)
Parameters:dataframe (pd.DataFrame) – A pandas dataframe that contains the dataset e.g. train_dataframe
categorical_or_numeric(numbers_column_list: list, threshold: float)

Numeric continuous-discrete finder

The function extracts the name of the columns that contain numeric discrete values and the columns that have numeric continuous values from the columns that contain only number data type. The decision is based on the unique values in that column. If the number of the unique values is less than the pre-defined threshold, the column type will be considered categorical. It returns two lists of strings.

Parameters:
  • numbers_column_list (list) – A list of strings which are the names of the columns that contain number value type.
  • threshold (int) – It is the minimum number of the unique values that under it the column type will be considered categorical.
Returns:

  • categorical_columns - the list of the columns’ names of the columns that contain numeric discrete data.
  • numeric_columns - The list of the columns’ names that contain numeric continuous data.

find_date_columns() → list

Date data type finder

This method finds date columns automatically.

Returns:date_columns: list of columns that contain date format data
json_detector(columns_with_strings: list)

Valid JSON data finder

This method detects if there is valid nested JSON data inside the columns that have string data. It return Two lists of strings

Parameters:columns_with_strings (list) – List of the columns that contain string data
Returns:
  • string_columns - A list contains the name of the columns that don’t have valid JSON nested data.
  • json_columns - A list contain the name of the columns that have valid nested JSON data.
number_or_string(date_columns: list)

Numeric-string finder

The function extracts which columns in the pandas dataframe contain numeric values and which have string values. It returns three lists of strings

Parameters:date_columns (list) – contains the name of the columns that have date format data to exclude those columns from the search.
Returns:
  • string_columns - A list contains the column names that contain string type data.
  • numeric_columns - contains the list of columns that have numbers.
  • other_columns - contains the name of columns that have unknown type of data if they exist
preprocessing.data_type_detector.detect_column_types(dataframe: pandas.core.frame.DataFrame, threshold: int = 50)

Features’ types detector

This function applies the methods defined in the ColumnDataFormat class to detect data format in each column.

Parameters:
  • dataframe (pd.DataFrame) – A pandas dataframe that contains the dataset e.g. train_dataframe
  • threshold (int) – the minimum number of the unique values that under it the column type will be considered categorical. The default value here is 50. This becomes very important when applying one-hot encoding.
Returns:

  • number_of_columns - An integer which is the total number of features. It is used for the validation purpose.
  • columns_types_list - A list of lists:
  • string_columns - A list of strings which are the columns that contain categorical data type with string labels e.g. Yes, No, Maybe.
  • categorical_integer - A list of strings which are the columns that contain categorical data type with numeric labels e.g. 0, 1, 2
  • numeric_columns - A list of strings which are the columns that contain columns contains numeric continuous values e.g. float like 0.1, 0.2 or large number of labels of numeric categorical data (larger than the threshold).
  • date_columns - A list of strings which are the columns that contain columns contain date format data. e.g. 2015-01-05
  • other_columns - A list of strings which are the columns that contain columns that has some other types ( Not implemented yet)

preprocessing.data_type_detector.detect_columns_types_summary(dataframes_dict: dict, threshold: int = 50) → dict

Data type summarizer

This function summarize the findings after applying the detect_column_types function to each given dataset.

Parameters:
  • dataframes_dict (dict) – a dictionary of pandas dataframes e.g. {“train”: train_dataframe, “test”: test_dataframe}
  • threshold (int) – The maximum number of categories that a categorical feature should have before considering it as continuous numeric feature.
Returns:

columns_types_dict: A dictionary that contains the lists of the columns filtered based on the type of the data that they contain.

preprocessing.data_transformer.standard_scale_numeric_features(dataframe_dict: dict, reference_dataframe_key: str, columns_to_normalize: list, handle_missing_values: bool = True) → dict

Feature standardizer

This function standardizes the datesets passed as pandas dataframes by a dictionary. The reference dataframe will be used to calculate the statistical properties (mean and standard deviation) and used to normalize other dataframes. After standardizing the dataset, the features in each dataset have 0 mean and 1 standard deviation.

Parameters:
  • dataframe_dict (dict) – A dictionary that contains multiple pandas dataframes
  • reference_dataframe_key (str) – A string that is used to fit the scaler e.g. “train”
  • columns_to_normalize (list) – A list of the columns that should be standardized e.g. [col_1, col_2, …, col_n]
  • handle_missing_values (bool) – boolean if true the missing values will be replaced by the value 0
Returns:

scaled_dataframe_dict: A dictionary of pandas dataframes where the “columns_to_normalize” are normalized

Return type:

dict

preprocessing.data_transformer.encoding_categorical_feature(dataset_dict: dict, feature_name: str, print_results: Union[bool, int] = True, print_counter: int = 0) → dict

Single categorical feature string encoder

This function encodes categorical features. It is possible to use train data alone or all train data, validation data and test data. If all datesets are provided (i.e. train, valid and test), they will be concatenated first and then encoded.

Parameters:
  • print_counter (int) – if print_results is int, print counter control printing data to the conosle based on the print_results value.
  • int] print_results (Union[bool,) – If False, no data is printed to the console. If True, all data is printed to the console. If an integer n, only the data for n features is printed to the console.
  • feature_name (str) – The name of the feature/column that its values should be encoded.
  • dataset_dict (dict) – a dictionary of pandas series (i.e one column) that must contain the train data and optionally contains valid data and test data
Returns:

dataset_dict_encoded: a dictionary of pandas series (i.e one column) after encoding.

preprocessing.data_transformer.encode_categorical_features(dataframe_dict: dict, columns_list: list, print_results: Union[bool, int] = True) → dict

Categorical features string encoder

This function applies the encoding_categorical_feature function to each feature in the columns_list.

Parameters:
  • int] print_results (Union[bool,) – If False, no data is printed to the console. If True, all data is printed to the console. If an integer n, only the data for n features is printed to the console.
  • dataframe_dict (dict) – a dictionary of Pandas dataframes.
  • columns_list (list) – The list of the names of the columns/features that their values should be encoded.
Returns:

dataframe_dict: a dictionary of Pandas dataframes after encoding.

preprocessing.data_clean.drop_corr_columns(dataframe: pandas.core.frame.DataFrame, drop_columns: bool = True, print_columns: bool = True, threshold: float = 0.98) → pandas.core.frame.DataFrame

Correlated columns eliminator

The function drop correlated columns and keep only one of these columns. Usually removing high correlated columns gives improvement in model’s quality. The task of this function is first to print list of the most correlated columns and then remove them by threshold. For more information, please refer to pandas.DataFrame.corr description: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

Parameters:
  • dataframe (pd.DataFrame) – Pandas dataframe which contains the dataset e.g. train_dataframe.
  • drop_columns (bool) – If true, all correlated columns will be dropped but one.
  • print_columns (bool) – If True, information about the correlated columns will be printed to the console.
  • threshold (float) – A value between 0 and 1. If the correlation between two columns is larger than this. value, they are considered highly correlated. If drop_columns is True, one of those columns will be dropped. The recommended value of the threshold is in [0.7 … 1].
Returns:

dataframe: A pandas dataframe which contains the dataset after dropping the correlated columns if drop_columns = True. Otherwise, the same input dataframe will be returned.

Example:

For checking correlated columns:

>>> dataframe = drop_corr_columns(dataframe, drop_columns=False, print_columns=True, threshold=0.85)
preprocessing.data_clean.drop_const_columns(dataframe: pandas.core.frame.DataFrame, drop_columns: bool = True, print_columns: bool = True) → pandas.core.frame.DataFrame

Constant value columns eliminator

This function drops columns that contain constant values. Usually removing constant columns gives improvement in model’s quality. The task of this function is first to print list of constant columns and then drop them.

Parameters:
  • dataframe (pd.DataFrame) – A pandas dataframe that contain the dataset e.g. train_dataframe
  • drop_columns (bool) – If true, the columns that contain constant values along all the rows will be dropped.
  • print_columns (bool) – If true, information about the columns that contain constant values will be printed to the console
Returns:

dataframe: A pandas dataframe that contains the dataset after dropping the the columns that contain constant values if drop_columns = True

Example:

For checking the columns which have constant value:

>>> dataframe = drop_const_columns(dataframe, drop_columns=False, print_columns=True)
preprocessing.data_explorer.print_repeated_values(series_data: pandas.core.series.Series)

Repeated values displayer

This function prints out into the console the results of value_counts. It shows count_values.head() or tail()
Parameters:series_data (pd.Series) – It the values of one of the features that are in the given dataset.
preprocessing.data_explorer.explore_data(dataframe)

Interactive data explorer

This function should be run in a Jupyter notebook. The user can go through the feature interactively using a slider.

Parameters:dataframe (pd.DataFrame) – A pandas dataframe that contains the dataset e.g. train_dataframe.
class preprocessing.data_explorer.ExploreData(dataframe: pandas.core.frame.DataFrame)

Data explorer

This class have the data explore method which can be used to explore the data in each column in the dataset.

Parameters:dataframe (pd.DataFrame) – A pandas dataframe that contains the dataset e.g. train_dataframe
__init__(dataframe: pandas.core.frame.DataFrame)
Parameters:dataframe (pd.DataFrame) – A pandas dataframe that contains the dataset e.g. train_dataframe
data_explore(column_i: str)

Feature explorer

This method displays a summary about the given feature including missing values, most and least repeated values. Besides that, it shows the histogram of the numeric data.

Parameters:column_i (str) – The name of the feature that the user is interested in exploring.
preprocessing.json_preprocessor.extract_json_from_list_dict(row_i: Union[list, dict], object_nr: int) → dict

Valid JSON object extractor from list or dict

The function extracts the valid JSON data directly from the list or the dict.

Parameters:
  • dict] row_i (Union[list,) – The content of the row which has the index i in a Pandas series
  • object_nr (int) – If there are multiple valid JSON objects detected, the object which has object_nr will be returned
Returns:

valid_json_object that has the index object_nr

Return type:

dict

preprocessing.json_preprocessor.extract_json_objects(raw_string_data: str, start_json_object: list, end_json_object: list, object_nr: int)

Valid JSON object extractor

The function extracts valid JSON objects from texts. Those objects could not be extracted using the json.load

Parameters:
  • raw_string_data (str) – The string object that could contain valid JSON objects
  • start_json_object (list) – List of integers that point to the char “{” which can be the start of the JSON objects.
  • end_json_object (list) – List of integers that point to the char “}” which can be the end of the JSON objects.
  • object_nr (int) – points to the index of the object that should be extracted if there are multiple valid JSON objects.
Returns:

valid_json_object that has the index object_nr

Return type:

dict

preprocessing.json_preprocessor.normalize_feature(string_data: str, object_nr: int)

JSON data searcher

This function searches for possible valid JSON data inside the given text. It identifies the possible JSON objects by defining their edges using “{” and “}”. It passes each defined object to the “extract_json_objects” function to extract the valid JSON objects. The returned valid objects will be normalized and returned as a Pandas dataframe.

Parameters:
  • string_data (str) – The string that could contain valid JSON objects.
  • object_nr (int) – If there are multiple valid JSON objects detected, the object which has object_nr will be returned
Returns:

Pandas dataframe which has “n” number of columns and one row.

Return type:

pandas.DataFrame

preprocessing.json_preprocessor.apply_normalize_feature(dataseries: pandas.core.series.Series, keys_amount: int)

JSON-dataframe converter

Parameters:
  • dataseries (pd.Series) – the feature that contains possible JSON objects
  • keys_amount (int) – The possible numbers of the keys or parent JSON object that a row may contain.
Returns:

a list of dataframes. Each element of the this list represents the normalized JSON object in each row of the dataseries

Return type:

list

preprocessing.json_preprocessor.column_validation(dataframe: pandas.core.frame.DataFrame, parent_columns: list, feature: str)

Column name validator

The function ensures that the dataframe doesn’t have two features that have the same name. It changes the name of the column after normalizing the JSON object based on the name of the parent feature

Parameters:
  • dataframe (pd.DataFrame) – the normalized JSON objects that were found in the given feature.
  • parent_columns (list) – A list of the name of the features or columns of main dataset
  • feature (str) – The name of the feature that contains the JSON objects
Returns:

Pandas dataframe with valid names for the columns

Return type:

pd.DataFrame

preprocessing.json_preprocessor.combine_new_data_to_original(dataframe: pandas.core.frame.DataFrame, dataframe_list: list, feature: str)

Dataframes binder

The function concatenates the original dataframe and the new created dataframe together.

Parameters:
  • dataframe (pd.DataFrame) – the original dataframe / dataset
  • dataframe_list (list) – list of the dataframes that are created from normalizing the JSON objects in each row of the given feature
  • feature (str) – The name of the feature that contains JSON objects
Returns:

Pandas dataframe that contains both the original and the new created datasets. The original feature will be deleted

Return type:

pd.DataFrame

preprocessing.json_preprocessor.feature_with_json_detector(dataseries: pandas.core.series.Series)

JSON detector

This function detect if there is features in the dataset that could have valid JSON objects.

Parameters:dataseries (pd.Series) – The feature’s values that should be tested for possible valid JSON objects
Returns:True if there is JSON objects candidates and False if not
Return type:bool
preprocessing.json_preprocessor.combine_columns(dataframes_dict, feature)

For avoiding different numbers of generated columns for different datasets, I combine them in one large dataframe :param dataframes_dict: :param feature: :return:

preprocessing.json_preprocessor.flat_json(dataframes_dict, json_columns, keys_amount=10)

:param dataframes_dict :param json_columns: :param keys_amount :return:

preprocessing.utils.read_data(path: str, files_list: list, rows_amount: int = 0) → dict

CSV file reader

This function reads CSV files that their names are listed inside the file_list

Parameters:
  • path (str) – It points to the directory where the data is stored.
  • files_list (list) – A list of strings which are the names of the files.
  • rows_amount (int) – The number of rows that should be read from the CSV file. If 0, all rows will be read.
Returns:

dataframes_dictionary: A dictionary that contains Pandas dataframes. The keys are the name of the files without the csv extension and the values are the associated dataframes.

Raise:
  • ValueError - In case of rows_amount has invalid value
Example:
>>> path = "./data"
>>> files_list = ["train.csv", "test.csv"]
>>> dataframes_dictionary = {"train": train_dataframe, "test": test_dataframe}