bluecast.eda.data_leakage_checks¶
Module Contents¶
Functions¶
|
Detect data leakage by checking for high correlations between the target column |
|
Detect data leakage by calculating Theil's U for categorical variables with respect to the target. |
- bluecast.eda.data_leakage_checks.detect_leakage_via_correlation(data: pandas.DataFrame, target_column: str | float | int, threshold: float = 0.9) List[str | float | int | None]¶
Detect data leakage by checking for high correlations between the target column and other columns in the DataFrame. The target column must be part of the provided DataFrame.
- Parameters:
data – The DataFrame containing the data (numerical columns only for features)
target_column – The name of the target column to check for correlations.
threshold – The correlation threshold. If the absolute correlation value is greater than or equal to this threshold, it will be considered as a potential data leakage.
- Returns:
True if data leakage is detected, False if not.
- bluecast.eda.data_leakage_checks.detect_categorical_leakage(data: pandas.DataFrame, target_column: str | float | int, threshold: float = 0.9) List[str | float | int | None]¶
Detect data leakage by calculating Theil’s U for categorical variables with respect to the target. The target column must be part of the provided DataFrame.
- Parameters:
data – The DataFrame containing the data.
target_column – The name of the target column.
threshold – The threshold for Theil’s U. Columns with U greater than or equal to this threshold will be considered potential data leakage.
- Returns:
A list of column names with Theil’s U greater than or equal to the threshold.