bluecast.eda.data_leakage_checks

Module Contents

Functions

detect_leakage_via_correlation(→ List[Union[str, ...)

Detect data leakage by checking for high correlations between the target column

detect_categorical_leakage(→ List[Union[str, float, ...)

Detect data leakage by calculating Theil's U for categorical variables with respect to the target.

bluecast.eda.data_leakage_checks.detect_leakage_via_correlation(data: pandas.DataFrame, target_column: str | float | int, threshold: float = 0.9) List[str | float | int | None]

Detect data leakage by checking for high correlations between the target column and other columns in the DataFrame. The target column must be part of the provided DataFrame.

Parameters:
  • data – The DataFrame containing the data (numerical columns only for features)

  • target_column – The name of the target column to check for correlations.

  • threshold – The correlation threshold. If the absolute correlation value is greater than or equal to this threshold, it will be considered as a potential data leakage.

Returns:

True if data leakage is detected, False if not.

bluecast.eda.data_leakage_checks.detect_categorical_leakage(data: pandas.DataFrame, target_column: str | float | int, threshold: float = 0.9) List[str | float | int | None]

Detect data leakage by calculating Theil’s U for categorical variables with respect to the target. The target column must be part of the provided DataFrame.

Parameters:
  • data – The DataFrame containing the data.

  • target_column – The name of the target column.

  • threshold – The threshold for Theil’s U. Columns with U greater than or equal to this threshold will be considered potential data leakage.

Returns:

A list of column names with Theil’s U greater than or equal to the threshold.