bluecast.monitoring.data_monitoring

Module containing classes and function to monitor data drifts.

This is meant for pipelines on production.

Module Contents

Classes

DataDrift

Monitor data drift.

class bluecast.monitoring.data_monitoring.DataDrift(random_seed=20)

Monitor data drift.

Class holding various functions to measure and visualize data drift. This is suitable for batch models and not recommended for online models.

kolmogorov_smirnov_test(data: pandas.DataFrame, new_data: pandas.DataFrame, threshold: float = 0.05)

Checks for data drift in new data based on K-S test.

OThe K-S test is a nonparametric test that compares the cumulative distributions of two numerical data sets. Only columns falling under pd.api.types.is_numeric_dtype will be considered.

Parameters:
  • data – Pandas DataFrame with the original data

  • new_data – Pandas DataFrame containing new data to compare against

  • threshold – Threshold for the Kolmogorov-Smirnov test (default is 0.05)

Return drift_flags:

Dictionary containing flags indicating data drift for each column

_calculate_psi(expected, actual, buckets=10) float
population_stability_index(data: pandas.DataFrame, new_data: pandas.DataFrame) Dict[str, bool]

Checks for data drift in new, categorical data based on population stability index.

Interpretation of PSI scores: - psi <= 0.1: no change or shift in the distributions of both datasets. - psi 0.1 < PSI <0.2: indicates a slight change or shift has occurred. - psi > 0.2: indicates a large shift in the distribution has occurred between both datasets

Parameters:
  • data – Pandas DataFrame with the original data

  • new_data – Pandas DataFrame containing new data to compare against

Return drift_flags:

Dictionary containing flags indicating data drift for each column

qqplot_two_samples(x, y, x_label: str = 'X', y_label: str = 'Y', quantiles=None, interpolation='nearest', ax=None, rug=True, rug_length=0.05, rug_kwargs=None, **kwargs)

Draw a quantile-quantile plot for x versus y.

Parameters:
  • x – array-like one-dimensional numeric array or Pandas series

  • y – array-like one-dimensional numeric array or Pandas series

  • x_label – String defining the x-axis label

  • y_label – String defining the y-axis label

:param axmatplotlib.axes.Axes, optional

Axes on which to plot. If not provided, the current axes will be used.

:param quantilesint or array-like, optional

Quantiles to include in the plot. This can be an array of quantiles, in which case only the specified quantiles of x and y will be plotted. If this is an int n, then the quantiles will be n evenly spaced points between 0 and 1. If this is None, then min(len(x), len(y)) evenly spaced quantiles between 0 and 1 will be computed.

Parameters:
  • interpolation – {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’} Specify the interpolation method used to find quantiles when quantiles is an int or None. See the documentation for numpy.quantile().

  • rug – bool, optional If True, draw a rug plot representing both samples on the horizontal and vertical axes. If False, no rug plot is drawn.

  • rug_length – float in [0, 1], optional Specifies the length of the rug plot lines as a fraction of the total vertical or horizontal length.

  • rug_kwargs – dict of keyword arguments Keyword arguments to pass to matplotlib.axes.Axes.axvline() and matplotlib.axes.Axes.axhline() when drawing rug plots.

  • kwargs – dict of keyword arguments Keyword arguments to pass to matplotlib.axes.Axes.scatter() when drawing the q-q plot.

adversarial_validation(df: pandas.DataFrame, df_new: pandas.DataFrame, cat_columns: List | None, train_on_device: str = 'cpu') float

Perform adversarial validation to check if the new data is similar to the training data. If the AUC score is close to 0.5, then the new data is similar to the training data. The further the AUC score is from 0.5, the more different the new data is from the training data and multivariate data drift can be assumed.

Additionally, computes feature importance to understand which features contributed the most to identifying train and test rows.

Parameters:
  • df – Baseline DataFrame that is the point of comparison.

  • df_new – New DataFrame to compare against the baseline.

  • cat_columns – (Optional) List with names of categorical columns.

  • train_on_device – Device to train the model on. Options are ‘cpu’ and ‘gpu’. (Default is ‘cpu’)

Returns:

Auc score that indicates similarity and displays feature importance.