bluecast.blueprints.cast

Run fully configured classification blueprint.

Customization via class attributes is possible. Configs can be instantiated and provided to change Xgboost training. Default hyperparameter search space is relatively light-weight to speed up the prototyping. Can deal with binary and multi-class classification problems. Hyperparameter tuning can be switched off or even strengthened via cross-validation. This behaviour can be controlled via the config class attributes from config.training_config module.

Module Contents

Classes

BlueCast

Run fully configured classification blueprint.

class bluecast.blueprints.cast.BlueCast(class_problem: Literal[binary, multiclass], cat_columns: List[str | float | int] | None = None, date_columns: List[str | float | int] | None = None, time_split_column: str | None = None, ml_model: bluecast.ml_modelling.catboost.CatboostModel | Any | None = None, custom_in_fold_preprocessor: bluecast.preprocessing.custom.CustomPreprocessing | None = None, custom_last_mile_computation: bluecast.preprocessing.custom.CustomPreprocessing | None = None, custom_preprocessor: bluecast.preprocessing.custom.CustomPreprocessing | None = None, custom_feature_selector: bluecast.preprocessing.feature_selection.BoostaRootaWrapper | bluecast.preprocessing.custom.CustomPreprocessing | None = None, conf_training: bluecast.config.training_config.TrainingConfig | None = None, conf_xgboost: bluecast.config.training_config.XgboostTuneParamsConfig | bluecast.config.training_config.CatboostTuneParamsConfig | None = None, conf_params_xgboost: bluecast.config.training_config.XgboostFinalParamConfig | bluecast.config.training_config.CatboostFinalParamConfig | None = None, experiment_tracker: bluecast.experimentation.tracking.ExperimentTracker | None = None, single_fold_eval_metric_func: bluecast.evaluation.eval_metrics.ClassificationEvalWrapper | None = None)

Run fully configured classification blueprint.

Customization via class attributes is possible. Configs can be instantiated and provided to change Xgboost training. Default hyperparameter search space is relatively light-weight to speed up the prototyping. :param :class_problem: Takes a string containing the class problem type. Either “binary” or “multiclass”. :param :target_column: Takes a string containing the name of the target column. :param :cat_columns: Takes a list of strings containing the names of the categorical columns. If not provided,

BlueCast will infer these automatically.

Parameters:
  • :date_columns – Takes a list of strings containing the names of the date columns. If not provided, BlueCast will infer these automatically.

  • :time_split_column – Takes a string containing the name of the time split column. If not provided, BlueCast will not split the data by time or order, but do a random split instead.

  • :ml_model – Takes an instance of a CatboostModel class. If not provided, BlueCast will instantiate one. This is an API to pass any model class. Inherit the baseclass from ml_modelling.base_model.BaseModel.

  • custom_in_fold_preprocessor – Takes an instance of a CustomPreprocessing class. Allows users to eeecute preprocessing after the train test split within cv folds. This will be executed only if precise_cv_tuning in the conf_Training is True. Custom ML models need to implement this themselves. This step is only useful when the proprocessing step has a high chance of overfitting otherwise (i.e: oversampling techniques).

  • custom_preprocessor – Takes an instance of a CustomPreprocessing class. Allows users to inject custom preprocessing steps which take place right after the train test spit.

  • custom_last_mile_computation – Takes an instance of a CustomPreprocessing class. Allows users to inject custom preprocessing steps which take place right before the model training.

  • experiment_tracker – Takes an instance of an ExperimentTracker class. If not provided this will be initialized automatically.

  • single_fold_eval_metric_func – Takes a function which calculates the evaluation metric for a single fold. Default is matthews_corrcoef. This function is used to calculate the evaluation metric for each fold during hyperparameter tuning when hyperparameter_tuning_rounds = 1 (default). Lower must be better.

initial_checks(df: pandas.DataFrame) None
fit(df: pandas.DataFrame, target_col: str) None

Train a full ML pipeline.

fit_eval(df: pandas.DataFrame, df_eval: pandas.DataFrame, target_eval: pandas.Series, target_col: str) Dict[str, Any]

Train a full ML pipeline and evaluate on a holdout set.

This is a convenience function to train and evaluate on a holdout set. It is recommended to use this for model exploration. On production the simple fit() function should be used. :param :df: Takes a pandas DataFrame containing the training data and the targets. :param :df_eval: Takes a pandas DataFrame containing the evaluation data, but not the targets. :param :target_eval: Takes a pandas Series containing the evaluation targets. :param :target_col: Takes a string containing the name of the target column inside the training data df.

transform_new_data(df: pandas.DataFrame) pandas.DataFrame

Transform new data according to preprocessing pipeline.

predict(df: pandas.DataFrame, save_shap_values: bool = False, return_original_labels: bool = False) Tuple[numpy.ndarray, numpy.ndarray]

Predict on unseen data.

Return the predicted probabilities and the predicted classes: y_probs, y_classes = predict(df) :param df: Pandas DataFrame with unseen data :param save_shap_values: If True, calculates and saves shap values, so they can be used to plot

waterfall plots for selected rows o demand.

Parameters:

return_original_labels – If True, returns the original labels instead of the encoded ones.

predict_proba(df: pandas.DataFrame, save_shap_values: bool = False) numpy.ndarray

Predict class scores on unseen data.

Return the predicted probabilities and the predicted classes: y_probs = predict_proba(df) :param df: Pandas DataFrame with unseen data :param save_shap_values: If True, calculates and saves shap values, so they can be used to plot

waterfall plots for selected rows o demand.

calibrate(x_calibration: pandas.DataFrame, y_calibration: pandas.Series, **kwargs) None

Calibrate the model.

Via this function the nonconformity measures are taken and used to predict calibrated sets via the predict_sets function, or to return p-values of a row for being the class via the predict_p_values function. :param: x_calibration: Pandas DataFrame without target column, that has not been seen by the model during

training.

Parameters:

y_calibration – Pandas Series holding the target value, hat has not been seen by the model during training.

predict_p_values(df: pandas.DataFrame) numpy.ndarray

Create p-values for each class.

The p_values indicate the probability of being the correct class. :param df: Pandas DataFrame holding unseen data :returns: Numpy array where each column holds p-values of a row being the class. If string labels were passed

each column maps the index of target_label_encoder.target_label_mapping stored in this class.

predict_sets(df: pandas.DataFrame, alpha: float = 0.05) pandas.DataFrame

Create prediction sets based on a certain confidence level.

Conformal prediction guarantees, that the correct label is present in the prediction sets with a probability of 1 - alpha. :param df: Pandas DataFrame holding unseen data :param alpha: Float indicating the desired confidence level. :returns: a Pandas DataFrame with a column called ‘prediction_set’ holding a nested set with predicted classes.