:py:mod:`bluecast.monitoring.data_monitoring` ============================================= .. py:module:: bluecast.monitoring.data_monitoring .. autoapi-nested-parse:: Module containing classes and function to monitor data drifts. This is meant for pipelines on production. Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: bluecast.monitoring.data_monitoring.DataDrift Attributes ~~~~~~~~~~ .. autoapisummary:: bluecast.monitoring.data_monitoring.HAS_SCIPY_STATS .. py:data:: HAS_SCIPY_STATS :value: True .. py:class:: DataDrift(random_seed=20) Monitor data drift. Class holding various functions to measure and visualize data drift. This is suitable for batch models and not recommended for online models. .. py:method:: kolmogorov_smirnov_test(data: pandas.DataFrame, new_data: pandas.DataFrame, threshold: float = 0.05) Checks for data drift in new data based on K-S test. The K-S test is a nonparametric test that compares the cumulative distributions of two numerical data sets. Only columns falling under pd.api.types.is_numeric_dtype will be considered. Requires scipy.stats to be available. If scipy is not installed, this method will raise an ImportError. :param data: Pandas DataFrame with the original data :param new_data: Pandas DataFrame containing new data to compare against :param threshold: Threshold for the Kolmogorov-Smirnov test (default is 0.05) :return drift_flags: Dictionary containing flags indicating data drift for each column .. py:method:: _calculate_psi(expected, actual, buckets=10) -> float .. py:method:: population_stability_index(data: pandas.DataFrame, new_data: pandas.DataFrame) -> Dict[str, bool] Checks for data drift in new, categorical data based on population stability index. Interpretation of PSI scores: - psi <= 0.1: no change or shift in the distributions of both datasets. - psi 0.1 < PSI <0.2: indicates a slight change or shift has occurred. - psi > 0.2: indicates a large shift in the distribution has occurred between both datasets :param data: Pandas DataFrame with the original data :param new_data: Pandas DataFrame containing new data to compare against :return drift_flags: Dictionary containing flags indicating data drift for each column .. py:method:: qqplot_two_samples(x, y, x_label: str = 'X', y_label: str = 'Y', quantiles=None, interpolation='nearest', ax=None, rug=True, rug_length=0.05, rug_kwargs=None, **kwargs) Draw a quantile-quantile plot for `x` versus `y`. :param x: array-like one-dimensional numeric array or Pandas series :param y: array-like one-dimensional numeric array or Pandas series :param x_label: String defining the x-axis label :param y_label: String defining the y-axis label :param ax : matplotlib.axes.Axes, optional Axes on which to plot. If not provided, the current axes will be used. :param quantiles : int or array-like, optional Quantiles to include in the plot. This can be an array of quantiles, in which case only the specified quantiles of `x` and `y` will be plotted. If this is an int `n`, then the quantiles will be `n` evenly spaced points between 0 and 1. If this is None, then `min(len(x), len(y))` evenly spaced quantiles between 0 and 1 will be computed. :param interpolation: {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’} Specify the interpolation method used to find quantiles when `quantiles` is an int or None. See the documentation for numpy.quantile(). :param rug: bool, optional If True, draw a rug plot representing both samples on the horizontal and vertical axes. If False, no rug plot is drawn. :param rug_length: float in [0, 1], optional Specifies the length of the rug plot lines as a fraction of the total vertical or horizontal length. :param rug_kwargs: dict of keyword arguments Keyword arguments to pass to matplotlib.axes.Axes.axvline() and matplotlib.axes.Axes.axhline() when drawing rug plots. :param kwargs: dict of keyword arguments Keyword arguments to pass to matplotlib.axes.Axes.scatter() when drawing the q-q plot. .. py:method:: adversarial_validation(df: pandas.DataFrame, df_new: pandas.DataFrame, cat_columns: Optional[List], train_on_device: str = 'cpu') -> float Perform adversarial validation to check if the new data is similar to the training data. If the AUC score is close to 0.5, then the new data is similar to the training data. The further the AUC score is from 0.5, the more different the new data is from the training data and multivariate data drift can be assumed. Additionally, computes feature importance to understand which features contributed the most to identifying train and test rows. :param df: Baseline DataFrame that is the point of comparison. :param df_new: New DataFrame to compare against the baseline. :param cat_columns: (Optional) List with names of categorical columns. :param train_on_device: Device to train the model on. Options are 'cpu' and 'gpu'. (Default is 'cpu') :return: Auc score that indicates similarity and displays feature importance.