bluecast.blueprints.cast_cv¶
Module Contents¶
Classes¶
Wrapper to train and predict multiple BlueCast instances. |
- class bluecast.blueprints.cast_cv.BlueCastCV(class_problem: Literal[binary, multiclass] = 'binary', cat_columns: List[str | float | int] | None = None, stratifier: Any | None = None, conf_training: bluecast.config.training_config.TrainingConfig | None = None, conf_xgboost: bluecast.config.training_config.XgboostTuneParamsConfig | bluecast.config.training_config.CatboostTuneParamsConfig | None = None, conf_params_xgboost: bluecast.config.training_config.XgboostFinalParamConfig | bluecast.config.training_config.CatboostFinalParamConfig | None = None, experiment_tracker: bluecast.experimentation.tracking.ExperimentTracker | None = None, custom_in_fold_preprocessor: bluecast.preprocessing.custom.CustomPreprocessing | None = None, custom_last_mile_computation: bluecast.preprocessing.custom.CustomPreprocessing | None = None, custom_preprocessor: bluecast.preprocessing.custom.CustomPreprocessing | None = None, custom_feature_selector: bluecast.preprocessing.feature_selection.BoostaRootaWrapper | bluecast.preprocessing.custom.CustomPreprocessing | None = None, ml_model: bluecast.ml_modelling.catboost.CatboostModel | Any | None = None, single_fold_eval_metric_func: bluecast.evaluation.eval_metrics.ClassificationEvalWrapper | None = None)¶
Wrapper to train and predict multiple BlueCast instances.
A custom splitter can be provided.
- Parameters:
:class_problem – Takes a string containing the class problem type. Either “binary” or “multiclass”.
:target_column – Takes a string containing the name of the target column.
:cat_columns – Takes a list of strings containing the names of the categorical columns. If not provided, BlueCast will infer these automatically.
:date_columns – Takes a list of strings containing the names of the date columns. If not provided, BlueCast will infer these automatically.
:time_split_column – Takes a string containing the name of the time split column. If not provided, BlueCast will not split the data by time or order, but do a random split instead.
:ml_model – Takes an instance of a XgboostModel class. If not provided, BlueCast will instantiate one. This is an API to pass any model class. Inherit the baseclass from ml_modelling.base_model.BaseModel.
custom_in_fold_preprocessor – Takes an instance of a CustomPreprocessing class. Allows users to eeecute preprocessing after the train test split within cv folds. This will be executed only if precise_cv_tuning in the conf_Training is True. Custom ML models need to implement this themselves. This step is only useful when the proprocessing step has a high chance of overfitting otherwise (i.e: oversampling techniques).
custom_preprocessor – Takes an instance of a CustomPreprocessing class. Allows users to inject custom preprocessing steps which take place right after the train test spit.
custom_last_mile_computation – Takes an instance of a CustomPreprocessing class. Allows users to inject custom preprocessing steps which take place right before the model training.
experiment_tracker – Takes an instance of an ExperimentTracker class. If not provided this will be initialized automatically.
single_fold_eval_metric_func – Takes a function which calculates the evaluation metric for a single fold. Default is matthews_corrcoef. This function is used to calculate the evaluation metric for each fold during hyperparameter tuning when hyperparameter_tuning_rounds = 1 (default). Lower must be better.
- prepare_data(df: pandas.DataFrame, target: str) Tuple[pandas.DataFrame, pandas.Series]¶
- show_oof_scores(metric: str = 'matthews') Tuple[float, float]¶
Show out of fold scores.
When calling BlueCastCVRegression’s fit_eval function multiple BlueCastRegression instances are called and each of them predicts on unseen/oof data.
This function collects these scores and return mean and average of them.
- Parameters:
metric – String indicating which metric shall be returned.
- Returns:
Tuple with (mean, std) of oof scores
- fit(df: pandas.DataFrame, target_col: str) None¶
Fit multiple BlueCast instances on different data splits.
Input df is expected the target column.
- fit_eval(df: pandas.DataFrame, target_col: str) Tuple[float, float]¶
Fit multiple BlueCast instances on different data splits.
Input df is expected the target column. Evaluation is executed on out-of-fold dataset. in each split. :param df: Pandas DataFrame that includes the target column :param target_col: String indicating the name of the target column :returns Tuple of (oof_mean, oof_std) with scores on unseen data during eval
- predict(df: pandas.DataFrame, return_sub_models_preds: bool = False, save_shap_values: bool = False) Tuple[pandas.DataFrame | pandas.Series, pandas.DataFrame | pandas.Series]¶
Predict on unseen data using multiple trained BlueCast instances.
- Parameters:
df – Pandas DataFrame with unseen data
return_sub_models_preds – If true will return a DataFrame with the predictions of each model for each class stored in separate columns.
save_shap_values – If True, calculates and saves shap values, so they can be used to plot waterfall plots for selected rows o demand.
- predict_proba(df: pandas.DataFrame, return_sub_models_preds: bool = False, save_shap_values: bool = False) pandas.DataFrame | pandas.Series¶
Predict on unseen data using multiple trained BlueCast instances.
- Parameters:
df – Pandas DataFrame with unseen data
return_sub_models_preds – If true will return a DataFrame with the predictions of each model for each class stored in separate columns.
save_shap_values – If True, calculates and saves shap values, so they can be used to plot waterfall plots for selected rows o demand.
- calibrate(x_calibration: pandas.DataFrame, y_calibration: pandas.Series, **kwargs) None¶
Calibrate the model.
Via this function the nonconformity measures are taken and used to predict calibrated sets via the predict_sets function, or to return p-values of a row for being the class via the predict_p_values function. This calibrates the blended prediction of all sub models. :param: x_calibration: Pandas DataFrame without target column, that has not been seen by the model during
training.
- Parameters:
y_calibration – Pandas Series holding the target value, hat has not been seen by the model during training.
- predict_p_values(df: pandas.DataFrame) numpy.ndarray¶
Create p-values for each class.
The p_values indicate the probability of being the correct class. :param df: Pandas DataFrame holding unseen data :returns: Numpy array where each column holds p-values of a row being the class. If string labels were passed
each column maps the index of target_label_encoder.target_label_mapping stored in this class.
- predict_sets(df: pandas.DataFrame, alpha: float = 0.05) pandas.DataFrame¶
Create prediction sets based on a certain confidence level.
Conformal prediction guarantees, that the correct label is present in the prediction sets with a probability of 1 - alpha. :param df: Pandas DataFrame holding unseen data :param alpha: Float indicating the desired confidence level. :returns a Pandas DataFrame with a column called ‘prediction_set’ holding a nested set with predicted classes.