Preprocessing¶
-
class
preprocessing.data_type_detector.
ColumnDataFormat
(dataframe: pandas.core.frame.DataFrame)¶ Data type detector
This class contains methods that help to detect the type of the data in each column of the given dataframe. The supported data types for now are: date, categorical feature with string values, numeric features with integer values (categorical), numeric with continuous values, and nested JSON data format.
Param: pd.DataFrame dataframe: A pandas dataframe that contains the dataset e.g. train_dataframe.
Methods: - find_date_columns - Date data type finder
- number_or_string - Numeric-string finder
- json_detector- Valid JSON data finder
- categorical_or_numeric - Numeric continuous-discrete finder
-
__init__
(dataframe: pandas.core.frame.DataFrame)¶ Parameters: dataframe (pd.DataFrame) – A pandas dataframe that contains the dataset e.g. train_dataframe
-
categorical_or_numeric
(numbers_column_list: list, threshold: float)¶ Numeric continuous-discrete finder
The function extracts the name of the columns that contain numeric discrete values and the columns that have numeric continuous values from the columns that contain only number data type. The decision is based on the unique values in that column. If the number of the unique values is less than the pre-defined threshold, the column type will be considered categorical. It returns two lists of strings.
Parameters: - numbers_column_list (list) – A list of strings which are the names of the columns that contain number value type.
- threshold (int) – It is the minimum number of the unique values that under it the column type will be considered categorical.
Returns: - categorical_columns - the list of the columns’ names of the columns that contain numeric discrete data.
- numeric_columns - The list of the columns’ names that contain numeric continuous data.
-
find_date_columns
() → list¶ Date data type finder
This method finds date columns automatically.
Returns: date_columns: list of columns that contain date format data
-
json_detector
(columns_with_strings: list)¶ Valid JSON data finder
This method detects if there is valid nested JSON data inside the columns that have string data. It return Two lists of strings
Parameters: columns_with_strings (list) – List of the columns that contain string data Returns: - string_columns - A list contains the name of the columns that don’t have valid JSON nested data.
- json_columns - A list contain the name of the columns that have valid nested JSON data.
-
number_or_string
(date_columns: list)¶ Numeric-string finder
The function extracts which columns in the pandas dataframe contain numeric values and which have string values. It returns three lists of strings
Parameters: date_columns (list) – contains the name of the columns that have date format data to exclude those columns from the search. Returns: - string_columns - A list contains the column names that contain string type data.
- numeric_columns - contains the list of columns that have numbers.
- other_columns - contains the name of columns that have unknown type of data if they exist
-
preprocessing.data_type_detector.
detect_column_types
(dataframe: pandas.core.frame.DataFrame, threshold: int = 50)¶ Features’ types detector
This function applies the methods defined in the ColumnDataFormat class to detect data format in each column.
Parameters: - dataframe (pd.DataFrame) – A pandas dataframe that contains the dataset e.g. train_dataframe
- threshold (int) – the minimum number of the unique values that under it the column type will be considered categorical. The default value here is 50. This becomes very important when applying one-hot encoding.
Returns: - number_of_columns - An integer which is the total number of features. It is used for the validation purpose.
- columns_types_list - A list of lists:
- string_columns - A list of strings which are the columns that contain categorical data type with string labels e.g. Yes, No, Maybe.
- categorical_integer - A list of strings which are the columns that contain categorical data type with numeric labels e.g. 0, 1, 2
- numeric_columns - A list of strings which are the columns that contain columns contains numeric continuous values e.g. float like 0.1, 0.2 or large number of labels of numeric categorical data (larger than the threshold).
- date_columns - A list of strings which are the columns that contain columns contain date format data. e.g. 2015-01-05
- other_columns - A list of strings which are the columns that contain columns that has some other types ( Not implemented yet)
-
preprocessing.data_type_detector.
detect_columns_types_summary
(dataframes_dict: dict, threshold: int = 50) → dict¶ Data type summarizer
This function summarize the findings after applying the detect_column_types function to each given dataset.
Parameters: - dataframes_dict (dict) – a dictionary of pandas dataframes e.g. {“train”: train_dataframe, “test”: test_dataframe}
- threshold (int) – The maximum number of categories that a categorical feature should have before considering it as continuous numeric feature.
Returns: columns_types_dict: A dictionary that contains the lists of the columns filtered based on the type of the data that they contain.
-
preprocessing.data_transformer.
standard_scale_numeric_features
(dataframe_dict: dict, reference_dataframe_key: str, columns_to_normalize: list, handle_missing_values: bool = True) → dict¶ Feature standardizer
This function standardizes the datesets passed as pandas dataframes by a dictionary. The reference dataframe will be used to calculate the statistical properties (mean and standard deviation) and used to normalize other dataframes. After standardizing the dataset, the features in each dataset have 0 mean and 1 standard deviation.
Parameters: - dataframe_dict (dict) – A dictionary that contains multiple pandas dataframes
- reference_dataframe_key (str) – A string that is used to fit the scaler e.g. “train”
- columns_to_normalize (list) – A list of the columns that should be standardized e.g. [col_1, col_2, …, col_n]
- handle_missing_values (bool) – boolean if true the missing values will be replaced by the value 0
Returns: scaled_dataframe_dict: A dictionary of pandas dataframes where the “columns_to_normalize” are normalized
Return type: dict
-
preprocessing.data_transformer.
encoding_categorical_feature
(dataset_dict: dict, feature_name: str, print_results: Union[bool, int] = True, print_counter: int = 0) → dict¶ Single categorical feature string encoder
This function encodes categorical features. It is possible to use train data alone or all train data, validation data and test data. If all datesets are provided (i.e. train, valid and test), they will be concatenated first and then encoded.
Parameters: - print_counter (int) – if print_results is int, print counter control printing data to the conosle based on the print_results value.
- int] print_results (Union[bool,) – If False, no data is printed to the console. If True, all data is printed to the console. If an integer n, only the data for n features is printed to the console.
- feature_name (str) – The name of the feature/column that its values should be encoded.
- dataset_dict (dict) – a dictionary of pandas series (i.e one column) that must contain the train data and optionally contains valid data and test data
Returns: dataset_dict_encoded: a dictionary of pandas series (i.e one column) after encoding.
-
preprocessing.data_transformer.
encode_categorical_features
(dataframe_dict: dict, columns_list: list, print_results: Union[bool, int] = True) → dict¶ Categorical features string encoder
This function applies the encoding_categorical_feature function to each feature in the columns_list.
Parameters: - int] print_results (Union[bool,) – If False, no data is printed to the console. If True, all data is printed to the console. If an integer n, only the data for n features is printed to the console.
- dataframe_dict (dict) – a dictionary of Pandas dataframes.
- columns_list (list) – The list of the names of the columns/features that their values should be encoded.
Returns: dataframe_dict: a dictionary of Pandas dataframes after encoding.
-
preprocessing.data_clean.
drop_corr_columns
(dataframe: pandas.core.frame.DataFrame, drop_columns: bool = True, print_columns: bool = True, threshold: float = 0.98) → pandas.core.frame.DataFrame¶ Correlated columns eliminator
The function drop correlated columns and keep only one of these columns. Usually removing high correlated columns gives improvement in model’s quality. The task of this function is first to print list of the most correlated columns and then remove them by threshold. For more information, please refer to pandas.DataFrame.corr description: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
Parameters: - dataframe (pd.DataFrame) – Pandas dataframe which contains the dataset e.g. train_dataframe.
- drop_columns (bool) – If true, all correlated columns will be dropped but one.
- print_columns (bool) – If True, information about the correlated columns will be printed to the console.
- threshold (float) – A value between 0 and 1. If the correlation between two columns is larger than this. value, they are considered highly correlated. If drop_columns is True, one of those columns will be dropped. The recommended value of the threshold is in [0.7 … 1].
Returns: dataframe: A pandas dataframe which contains the dataset after dropping the correlated columns if drop_columns = True. Otherwise, the same input dataframe will be returned.
Example: For checking correlated columns:
>>> dataframe = drop_corr_columns(dataframe, drop_columns=False, print_columns=True, threshold=0.85)
-
preprocessing.data_clean.
drop_const_columns
(dataframe: pandas.core.frame.DataFrame, drop_columns: bool = True, print_columns: bool = True) → pandas.core.frame.DataFrame¶ Constant value columns eliminator
This function drops columns that contain constant values. Usually removing constant columns gives improvement in model’s quality. The task of this function is first to print list of constant columns and then drop them.
Parameters: - dataframe (pd.DataFrame) – A pandas dataframe that contain the dataset e.g. train_dataframe
- drop_columns (bool) – If true, the columns that contain constant values along all the rows will be dropped.
- print_columns (bool) – If true, information about the columns that contain constant values will be printed to the console
Returns: dataframe: A pandas dataframe that contains the dataset after dropping the the columns that contain constant values if drop_columns = True
Example: For checking the columns which have constant value:
>>> dataframe = drop_const_columns(dataframe, drop_columns=False, print_columns=True)
-
preprocessing.data_explorer.
print_repeated_values
(series_data: pandas.core.series.Series)¶ Repeated values displayer
This function prints out into the console the results of value_counts. It shows count_values.head() or tail()Parameters: series_data (pd.Series) – It the values of one of the features that are in the given dataset.
-
preprocessing.data_explorer.
explore_data
(dataframe)¶ Interactive data explorer
This function should be run in a Jupyter notebook. The user can go through the feature interactively using a slider.
Parameters: dataframe (pd.DataFrame) – A pandas dataframe that contains the dataset e.g. train_dataframe.
-
class
preprocessing.data_explorer.
ExploreData
(dataframe: pandas.core.frame.DataFrame)¶ Data explorer
This class have the data explore method which can be used to explore the data in each column in the dataset.
Parameters: dataframe (pd.DataFrame) – A pandas dataframe that contains the dataset e.g. train_dataframe -
__init__
(dataframe: pandas.core.frame.DataFrame)¶ Parameters: dataframe (pd.DataFrame) – A pandas dataframe that contains the dataset e.g. train_dataframe
-
data_explore
(column_i: str)¶ Feature explorer
This method displays a summary about the given feature including missing values, most and least repeated values. Besides that, it shows the histogram of the numeric data.
Parameters: column_i (str) – The name of the feature that the user is interested in exploring.
-
-
preprocessing.json_preprocessor.
extract_json_from_list_dict
(row_i: Union[list, dict], object_nr: int) → dict¶ Valid JSON object extractor from list or dict
The function extracts the valid JSON data directly from the list or the dict.
Parameters: - dict] row_i (Union[list,) – The content of the row which has the index i in a Pandas series
- object_nr (int) – If there are multiple valid JSON objects detected, the object which has object_nr will be returned
Returns: valid_json_object that has the index object_nr
Return type: dict
-
preprocessing.json_preprocessor.
extract_json_objects
(raw_string_data: str, start_json_object: list, end_json_object: list, object_nr: int)¶ Valid JSON object extractor
The function extracts valid JSON objects from texts. Those objects could not be extracted using the json.load
Parameters: - raw_string_data (str) – The string object that could contain valid JSON objects
- start_json_object (list) – List of integers that point to the char “{” which can be the start of the JSON objects.
- end_json_object (list) – List of integers that point to the char “}” which can be the end of the JSON objects.
- object_nr (int) – points to the index of the object that should be extracted if there are multiple valid JSON objects.
Returns: valid_json_object that has the index object_nr
Return type: dict
-
preprocessing.json_preprocessor.
normalize_feature
(string_data: str, object_nr: int)¶ JSON data searcher
This function searches for possible valid JSON data inside the given text. It identifies the possible JSON objects by defining their edges using “{” and “}”. It passes each defined object to the “extract_json_objects” function to extract the valid JSON objects. The returned valid objects will be normalized and returned as a Pandas dataframe.
Parameters: - string_data (str) – The string that could contain valid JSON objects.
- object_nr (int) – If there are multiple valid JSON objects detected, the object which has object_nr will be returned
Returns: Pandas dataframe which has “n” number of columns and one row.
Return type: pandas.DataFrame
-
preprocessing.json_preprocessor.
apply_normalize_feature
(dataseries: pandas.core.series.Series, keys_amount: int)¶ JSON-dataframe converter
Parameters: - dataseries (pd.Series) – the feature that contains possible JSON objects
- keys_amount (int) – The possible numbers of the keys or parent JSON object that a row may contain.
Returns: a list of dataframes. Each element of the this list represents the normalized JSON object in each row of the dataseries
Return type: list
-
preprocessing.json_preprocessor.
column_validation
(dataframe: pandas.core.frame.DataFrame, parent_columns: list, feature: str)¶ Column name validator
The function ensures that the dataframe doesn’t have two features that have the same name. It changes the name of the column after normalizing the JSON object based on the name of the parent feature
Parameters: - dataframe (pd.DataFrame) – the normalized JSON objects that were found in the given feature.
- parent_columns (list) – A list of the name of the features or columns of main dataset
- feature (str) – The name of the feature that contains the JSON objects
Returns: Pandas dataframe with valid names for the columns
Return type: pd.DataFrame
-
preprocessing.json_preprocessor.
combine_new_data_to_original
(dataframe: pandas.core.frame.DataFrame, dataframe_list: list, feature: str)¶ Dataframes binder
The function concatenates the original dataframe and the new created dataframe together.
Parameters: - dataframe (pd.DataFrame) – the original dataframe / dataset
- dataframe_list (list) – list of the dataframes that are created from normalizing the JSON objects in each row of the given feature
- feature (str) – The name of the feature that contains JSON objects
Returns: Pandas dataframe that contains both the original and the new created datasets. The original feature will be deleted
Return type: pd.DataFrame
-
preprocessing.json_preprocessor.
feature_with_json_detector
(dataseries: pandas.core.series.Series)¶ JSON detector
This function detect if there is features in the dataset that could have valid JSON objects.
Parameters: dataseries (pd.Series) – The feature’s values that should be tested for possible valid JSON objects Returns: True if there is JSON objects candidates and False if not Return type: bool
-
preprocessing.json_preprocessor.
combine_columns
(dataframes_dict, feature)¶ For avoiding different numbers of generated columns for different datasets, I combine them in one large dataframe :param dataframes_dict: :param feature: :return:
-
preprocessing.json_preprocessor.
flat_json
(dataframes_dict, json_columns, keys_amount=10)¶ :param dataframes_dict :param json_columns: :param keys_amount :return:
-
preprocessing.utils.
read_data
(path: str, files_list: list, rows_amount: int = 0) → dict¶ CSV file reader
This function reads CSV files that their names are listed inside the file_list
Parameters: - path (str) – It points to the directory where the data is stored.
- files_list (list) – A list of strings which are the names of the files.
- rows_amount (int) – The number of rows that should be read from the CSV file. If 0, all rows will be read.
Returns: dataframes_dictionary: A dictionary that contains Pandas dataframes. The keys are the name of the files without the csv extension and the values are the associated dataframes.
Raise: - ValueError - In case of rows_amount has invalid value
Example: >>> path = "./data" >>> files_list = ["train.csv", "test.csv"] >>> dataframes_dictionary = {"train": train_dataframe, "test": test_dataframe}