bluecast.config.training_config¶
Define training and common configuration parameters.
Pydantic dataclasses are used to define the configuration parameters. This allows for type checking and validation of the configuration parameters. The configuration parameters are used in the training pipeline and in the evaluation pipeline. Pydantic dataclasses are used to allow users a pythonic way to define the configuration parameters. Default configurations can be loaded, adjusted and passed into the blueprints.
Module Contents¶
Classes¶
Define general training parameters. |
|
Define hyperparameter tuning search space. |
|
Define hyperparameter tuning search space. |
|
Define final hyper parameters. |
|
Define final hyper parameters. |
|
Define hyperparameter tuning search space for CatBoost (classification or multiclass). |
|
Define hyperparameter tuning search space for CatBoost (regression). |
|
Define final hyperparameters for CatBoost (classification or multiclass) using CatBoost defaults. |
|
Define final hyperparameters for CatBoost (regression) using CatBoost defaults. |
- class bluecast.config.training_config.TrainingConfig(global_random_state: int = 33, increase_random_state_in_bluecast_cv_by: int = 200, shuffle_during_training: bool = True, hyperparameter_tuning_rounds: int = 200, hyperparameter_tuning_max_runtime_secs: int = 3600, hypertuning_cv_folds: int = 5, hypertuning_cv_repeats: int = 1, sample_data_during_tuning: bool = False, sample_data_during_tuning_alpha: float = 2.0, precise_cv_tuning: bool = False, early_stopping_rounds: int | None = 20, autotune_model: bool = True, autotune_on_device: str = 'cpu', autotune_n_random_seeds: int = 1, plot_hyperparameter_tuning_overview: bool = True, enable_feature_selection: bool = False, calculate_shap_values: bool = True, shap_waterfall_indices: List[int | None] | None = None, show_dependence_plots_of_top_n_features: int = 0, store_shap_values_in_instance: bool = False, train_size: float = 0.8, train_split_stratify: bool = True, use_full_data_for_final_model: bool = True, cardinality_threshold_for_onehot_encoding: int = 5, infrequent_categories_threshold: int = 5, cat_encoding_via_ml_algorithm: bool = True, show_detailed_tuning_logs: bool = False, optuna_sampler_n_startup_trials: int = 10, enable_grid_search_fine_tuning: bool = False, gridsearch_tuning_max_runtime_secs: int = 3600, gridsearch_nb_parameters_per_grid: int = 5, bluecast_cv_train_n_model: Tuple[int, int] = (5, 1), logging_file_path: str | None = None, experiment_name: str = 'new experiment', out_of_fold_dataset_store_path: str | None = None, optuna_db_backend_path: str | None = None)¶
Define general training parameters.
- Parameters:
global_random_state – Global random state to use for reproducibility.
increase_random_state_in_bluecast_cv_by – In BlueCastCV multiple models are trained. Define by how much the random state changes with each additional model.
shuffle_during_training – Whether to shuffle the data during training when hypertuning_cv_folds > 1.
hyperparameter_tuning_rounds – Number of hyperparameter tuning rounds. Not used when custom ML model is passed.
hyperparameter_tuning_max_runtime_secs – Maximum runtime in seconds for hyperparameter tuning. Not used when custom ML model is passed.
hypertuning_cv_folds – Number of cross-validation folds to use for hyperparameter tuning. Not used when custom ML model is passed.
hypertuning_cv_repeats – Number of repetitions for each cross-validation fold during hyperparameter tuning. Not used when custom ML model is passed.
sample_data_during_tuning – Whether to sample the data during tuning. Not used when custom ML model is passed.
sample_data_during_tuning_alpha – Alpha value for sampling the data during tuning. The higher alpha the fewer samples will be left. Not used when custom ML model is passed.
class_weight_during_dmatrix_creation – Whether to use class weights during DMatrix creation. Not used when custom ML model is passed.
early_stopping_rounds – Number of early stopping rounds during final training or when hyperparameter tuning follows a single train-test split. Not used when custom ML model is passed.
autotune_model – Whether to autotune the model. Not used when custom ML model is passed.
autotune_on_device – Whether to autotune on CPU or GPU. Chose any of [F”gpu”, “cpu”]. Not used when custom ML model is passed.
autotune_n_random_seeds – Number of random seeds to use for autotuning. This changes Optuna’s random seed only. Will be updated back after every nth trial back again. Not used when custom ML model is passed.
plot_hyperparameter_tuning_overview – Whether to plot the hyperparameter tuning overview. Not used when custom ML model is passed.
enable_feature_selection – Whether to enable recursive feature selection.
calculate_shap_values – Whether to calculate shap values. Also used when custom ML model is passed. Not compatible with all ML models. See the SHAP documentation for more details.
shap_waterfall_indices – List of sample indices to plot. Each index represents a sample (i.e.: [0, 1, 499]).
show_dependence_plots_of_top_n_features – Maximum number of dependence plots to show. Not used when custom ML model is passed.
store_shap_values_in_instance – Whether to store the SHAP values in the BlueCast instance. Not applicable when custom ML model is used.
train_size – Train size to use for train-test split.
train_split_stratify – Whether to stratify the train-test split. Not used when custom ML model is passed.
use_full_data_for_final_model – Whether to use the full data for the final model. This might cause overfitting. Not used when custom ML model is passed.
cardinality_threshold_for_onehot_encoding – Categorical features with a cardinality of less or equal this threshold will be onehot encoded. The rest will be target encoded. Will be ignored if cat_encoding_via_ml_algorithm is set to true.
infrequent_categories_threshold – Categories with a frequency of less this threshold will be grouped into a common group. This is done to reduce the risk of overfitting. Will be ignored if cat_encoding_via_ml_algorithm is set to true.
cat_encoding_via_ml_algorithm – Whether to use an ML algorithm for categorical encoding. If True, the categorical encoding is done via a ML algorithm. If False, the categorical encoding is done via a target encoding in the preprocessing steps. See the ReadMe for more details.
show_detailed_tuning_logs – Whether to show detailed tuning logs. Not used when custom ML model is passed.
enable_grid_search_fine_tuning – After hyperparameter tuning run Gridsearch tuning on a fine-grained grid based on the previous hyperparameter tuning. Only possible when autotune_model is True.
gridsearch_nb_parameters_per_grid – Decides how many steps the grid shall have per parameter.
gridsearch_tuning_max_runtime_secs – Sets the maximum time in seconds the tuning shall run. This will finish the latest trial nd will exceed this limit though.
experiment_name – Name of the experiment. Will be logged inside the ExperimentTracker.
logging_file_path – Path to the logging file. If None, the logging will be printed to the Jupyter notebook instead.
out_of_fold_dataset_store_path – Path to store the out of fold dataset. If None, the out of fold dataset will not be stored. Shall end with a slash. Only used when BlueCast instances are called with fit_eval method.
optuna_db_backend_path – Path to the Optuna database backend file. If provided as a string, Optuna will use a persistent SQLite database to store hyperparameter tuning progress, allowing resumption if tuning fails. If None (default), Optuna will use in-memory storage. Example: “/path/to/optuna_study.db”
- dict()¶
Return dictionary with all class attributes.
The implementation keeps backwards compatibility as this class has been a Pydantic Basemodel before.
- class bluecast.config.training_config.XgboostTuneParamsConfig(max_depth_min: int = 1, max_depth_max: int = 10, alpha_min: float = 1e-08, alpha_max: float = 100.0, lambda_min: float = 1.0, lambda_max: float = 100.0, gamma_min: float = 1e-08, gamma_max: float = 10.0, min_child_weight_min: float = 1.0, min_child_weight_max: float = 100.0, sub_sample_min: float = 0.1, sub_sample_max: float = 1.0, col_sample_by_tree_min: float = 0.1, col_sample_by_tree_max: float = 1.0, col_sample_by_level_min: float = 1.0, col_sample_by_level_max: float = 1.0, max_bin_min: int = 128, max_bin_max: int = 1024, eta_min: float = 0.001, eta_max: float = 0.3, steps_min: int = 1000, steps_max: int = 1000, verbosity_during_hyperparameter_tuning: int = 0, verbosity_during_final_model_training: int = 0, booster: List[str] | None = None, grow_policy: List[str] | None = None, tree_method: List[str] | None = None, xgboost_objective: str = 'multi:softprob', xgboost_eval_metric: str = 'mlogloss', xgboost_eval_metric_tune_direction: str = 'minimize')¶
Define hyperparameter tuning search space.
- Parameters:
max_depth_min – Minimum value for the maximum depth of the trees. Defaults to 1.
max_depth_max – Maximum value for the maximum depth of the trees. Defaults to 10.
alpha_min – Minimum value for L1 regularization term (alpha). Defaults to 1e-8.
alpha_max – Maximum value for L1 regularization term (alpha). Defaults to 100.
lambda_min – Minimum value for L2 regularization term (lambda). Defaults to 1.
lambda_max – Maximum value for L2 regularization term (lambda). Defaults to 100.
gamma_min – Minimum value for minimum loss reduction required to make a further partition on a leaf node of the tree (gamma). Defaults to 1e-8.
gamma_max – Maximum value for minimum loss reduction required to make a further partition on a leaf node of the tree (gamma). Defaults to 10.
min_child_weight_min – Minimum value for minimum sum of instance weight (hessian) needed in a child. Defaults to 1.
min_child_weight_max – Maximum value for minimum sum of instance weight (hessian) needed in a child. Defaults to 100.
sub_sample_min – Minimum value of subsample ratio of the training instances. Defaults to 0.1.
sub_sample_max – Maximum value of subsample ratio of the training instances. Defaults to 1.0.
col_sample_by_tree_min – Minimum value of subsample ratio of columns when constructing each tree. Defaults to 0.1.
col_sample_by_tree_max – Maximum value of subsample ratio of columns when constructing each tree. Defaults to 1.0.
col_sample_by_level_min – Minimum value of subsample columns for each split in each level. Defaults to 1.0.
col_sample_by_level_max – Maximum value of subsample columns for each split in each level. Defaults to 1.0.
max_bin_min – Minimum value for maximum number of bins. Defaults to 128.
max_bin_max – Maximum value for maximum number of bins. Defaults to 1024.
eta_min – Minimum value for learning rate (eta). Defaults to 1e-3.
eta_max – Maximum value for learning rate (eta). Defaults to 0.3.
steps_min – Minimum number of boosting rounds. Defaults to 1000.
steps_max – Maximum number of boosting rounds. Defaults to 1000.
verbosity_during_hyperparameter_tuning – Verbosity level during hyperparameter tuning. Defaults to 0.
verbosity_during_final_model_training – Verbosity level during final model training. Defaults to 0.
booster – List of booster types. Defaults to [“gbtree”].
grow_policy – List of grow policies. Defaults to [“depthwise”, “lossguide”].
tree_method – List of tree building methods. Defaults to [“exact”, “approx”, “hist”].
xgboost_objective – XGBoost objective. Defaults to “multi:softprob”.
xgboost_eval_metric – XGBoost evaluation metric. Defaults to “mlogloss”.
xgboost_eval_metric_tune_direction – Direction to tune the evaluation metric. Defaults to “minimize”. Must be any of [‘minimize’, ‘maximize’]
- dict()¶
Return dictionary with all class attributes.
The implementation keeps backwards compatibility as this class has been a Pydantic Basemodel before.
- class bluecast.config.training_config.XgboostTuneParamsRegressionConfig(max_depth_min: int = 1, max_depth_max: int = 10, alpha_min: float = 1e-08, alpha_max: float = 100.0, lambda_min: float = 1.0, lambda_max: float = 100.0, gamma_min: float = 1e-08, gamma_max: float = 10.0, min_child_weight_min: float = 1.0, min_child_weight_max: float = 100.0, sub_sample_min: float = 0.1, sub_sample_max: float = 1.0, col_sample_by_tree_min: float = 0.1, col_sample_by_tree_max: float = 1.0, col_sample_by_level_min: float = 1.0, col_sample_by_level_max: float = 1.0, max_bin_min: int = 128, max_bin_max: int = 1024, eta_min: float = 0.001, eta_max: float = 0.3, steps_min: int = 1000, steps_max: int = 1000, verbosity_during_hyperparameter_tuning: int = 0, verbosity_during_final_model_training: int = 0, booster: List[str] | None = None, grow_policy: List[str] | None = None, tree_method: List[str] | None = None, xgboost_objective: str = 'reg:squarederror', xgboost_eval_metric: str = 'rmse', xgboost_eval_metric_tune_direction: str = 'minimize')¶
Define hyperparameter tuning search space.
- Parameters:
max_depth_min – Minimum value for the maximum depth of the trees. Defaults to 1.
max_depth_max – Maximum value for the maximum depth of the trees. Defaults to 10.
alpha_min – Minimum value for L1 regularization term (alpha). Defaults to 1e-8.
alpha_max – Maximum value for L1 regularization term (alpha). Defaults to 100.
lambda_min – Minimum value for L2 regularization term (lambda). Defaults to 1.
lambda_max – Maximum value for L2 regularization term (lambda). Defaults to 100.
gamma_min – Minimum value for minimum loss reduction required to make a further partition on a leaf node of the tree (gamma). Defaults to 1e-8.
gamma_max – Maximum value for minimum loss reduction required to make a further partition on a leaf node of the tree (gamma). Defaults to 10.
min_child_weight_min – Minimum value for minimum sum of instance weight (hessian) needed in a child. Defaults to 1.
min_child_weight_max – Maximum value for minimum sum of instance weight (hessian) needed in a child. Defaults to 100.
sub_sample_min – Minimum value of subsample ratio of the training instances. Defaults to 0.1.
sub_sample_max – Maximum value of subsample ratio of the training instances. Defaults to 1.0.
col_sample_by_tree_min – Minimum value of subsample ratio of columns when constructing each tree. Defaults to 0.1.
col_sample_by_tree_max – Maximum value of subsample ratio of columns when constructing each tree. Defaults to 1.0.
col_sample_by_level_min – Minimum value of subsample columns for each split in each level. Defaults to 1.0.
col_sample_by_level_max – Maximum value of subsample columns for each split in each level. Defaults to 1.0.
max_bin_min – Minimum value for maximum number of bins. Defaults to 128.
max_bin_max – Maximum value for maximum number of bins. Defaults to 1024.
eta_min – Minimum value for learning rate (eta). Defaults to 1e-3.
eta_max – Maximum value for learning rate (eta). Defaults to 0.3.
steps_min – Minimum number of boosting rounds. Defaults to 1000.
steps_max – Maximum number of boosting rounds. Defaults to 1000.
verbosity_during_hyperparameter_tuning – Verbosity level during hyperparameter tuning. Defaults to 0.
verbosity_during_final_model_training – Verbosity level during final model training. Defaults to 0.
booster – List of booster types. Defaults to [“gbtree”].
grow_policy – List of grow policies. Defaults to [“depthwise”, “lossguide”].
tree_method – List of tree building methods. Defaults to [“exact”, “approx”, “hist”].
xgboost_objective – XGBoost objective. Defaults to “reg:squarederror”.
xgboost_eval_metric – XGBoost evaluation metric. Defaults to “rmse”.
xgboost_eval_metric_tune_direction – Direction to tune the evaluation metric. Defaults to “minimize”. Must be any of [‘minimize’, ‘maximize’]
- dict()¶
Return dictionary with all class attributes.
The implementation keeps backwards compatibility as this class has been a Pydantic Basemodel before.
- class bluecast.config.training_config.XgboostFinalParamConfig¶
Define final hyper parameters.
- params¶
- sample_weight: Dict[str, float] | None¶
- classification_threshold: float = 0.5¶
- class bluecast.config.training_config.XgboostRegressionFinalParamConfig¶
Define final hyper parameters.
- params¶
- sample_weight: Dict[str, float] | None¶
- classification_threshold: float = 999¶
- class bluecast.config.training_config.CatboostTuneParamsConfig(depth_min: int = 1, depth_max: int = 10, l2_leaf_reg_min: float = 1e-08, l2_leaf_reg_max: float = 100.0, bagging_temperature_min: float = 0.0, bagging_temperature_max: float = 10.0, random_strength_min: float = 0.0, random_strength_max: float = 10.0, subsample_min: float = 0.1, subsample_max: float = 1.0, border_count_min: int = 32, border_count_max: int = 255, learning_rate_min: float = 0.001, learning_rate_max: float = 0.3, iterations_min: int = 1000, iterations_max: int = 1000, verbosity_during_hyperparameter_tuning: int = 0, verbosity_during_final_model_training: int = 0, bootstrap_type: List[str] | None = None, grow_policy: List[str] | None = None, catboost_objective: str = 'MultiClass', catboost_eval_metric: str = 'MultiClass', catboost_eval_metric_tune_direction: str = 'minimize')¶
Define hyperparameter tuning search space for CatBoost (classification or multiclass).
- Parameters:
depth_min – Minimum value for the depth of the trees. Defaults to 1.
depth_max – Maximum value for the depth of the trees. Defaults to 10.
l2_leaf_reg_min – Minimum value for L2 regularization term (l2_leaf_reg). Defaults to 1e-8.
l2_leaf_reg_max – Maximum value for L2 regularization term (l2_leaf_reg). Defaults to 100.
bagging_temperature_min – Minimum value for bagging temperature when bootstrap_type=’Bayesian’. Defaults to 0.0.
bagging_temperature_max – Maximum value for bagging temperature when bootstrap_type=’Bayesian’. Defaults to 10.0.
random_strength_min – Minimum value for the random strength. Defaults to 0.0.
random_strength_max – Maximum value for the random strength. Defaults to 10.0.
subsample_min – Minimum value of subsample ratio of the training instances. Defaults to 0.1.
subsample_max – Maximum value of subsample ratio of the training instances. Defaults to 1.0.
border_count_min – Minimum value for the number of splits for numerical features. Defaults to 32.
border_count_max – Maximum value for the number of splits for numerical features. Defaults to 255.
learning_rate_min – Minimum value for learning rate. Defaults to 1e-3.
learning_rate_max – Maximum value for learning rate. Defaults to 0.3.
iterations_min – Minimum number of boosting rounds (iterations). Defaults to 1000.
iterations_max – Maximum number of boosting rounds (iterations). Defaults to 1000.
verbosity_during_hyperparameter_tuning – Verbosity level during hyperparameter tuning. Defaults to 0.
verbosity_during_final_model_training – Verbosity level during final model training. Defaults to 0.
bootstrap_type – List of bootstrap types to consider. Defaults to [“Bayesian”, “Poisson”, “MVS”, “No”].
grow_policy – List of grow policies. Defaults to [“SymmetricTree”].
catboost_objective – CatBoost objective. Defaults to “MultiClass”.
catboost_eval_metric – CatBoost evaluation metric. Defaults to “MultiClass”.
catboost_eval_metric_tune_direction – Direction to tune the evaluation metric. Defaults to “minimize”. Must be any of [‘minimize’, ‘maximize’]
- dict()¶
Return dictionary with all class attributes.
The implementation keeps backwards compatibility as this class mimics a Pydantic BaseModel.
- class bluecast.config.training_config.CatboostTuneParamsRegressionConfig(depth_min: int = 1, depth_max: int = 10, l2_leaf_reg_min: float = 1e-08, l2_leaf_reg_max: float = 100.0, bagging_temperature_min: float = 0.0, bagging_temperature_max: float = 10.0, random_strength_min: float = 0.0, random_strength_max: float = 10.0, subsample_min: float = 0.1, subsample_max: float = 1.0, border_count_min: int = 32, border_count_max: int = 255, learning_rate_min: float = 0.001, learning_rate_max: float = 0.3, iterations_min: int = 1000, iterations_max: int = 1000, verbosity_during_hyperparameter_tuning: int = 0, verbosity_during_final_model_training: int = 0, bootstrap_type: List[str] | None = None, grow_policy: List[str] | None = None, catboost_objective: str = 'RMSE', catboost_eval_metric: str = 'RMSE', catboost_eval_metric_tune_direction: str = 'minimize')¶
Define hyperparameter tuning search space for CatBoost (regression).
- Parameters:
depth_min – Minimum value for the depth of the trees. Defaults to 1.
depth_max – Maximum value for the depth of the trees. Defaults to 10.
l2_leaf_reg_min – Minimum value for L2 regularization term (l2_leaf_reg). Defaults to 1e-8.
l2_leaf_reg_max – Maximum value for L2 regularization term (l2_leaf_reg). Defaults to 100.
bagging_temperature_min – Minimum value for bagging temperature when bootstrap_type=’Bayesian’. Defaults to 0.0.
bagging_temperature_max – Maximum value for bagging temperature when bootstrap_type=’Bayesian’. Defaults to 10.0.
random_strength_min – Minimum value for the random strength. Defaults to 0.0.
random_strength_max – Maximum value for the random strength. Defaults to 10.0.
subsample_min – Minimum value of subsample ratio of the training instances. Defaults to 0.1.
subsample_max – Maximum value of subsample ratio of the training instances. Defaults to 1.0.
border_count_min – Minimum value for the number of splits for numerical features. Defaults to 32.
border_count_max – Maximum value for the number of splits for numerical features. Defaults to 255.
learning_rate_min – Minimum value for learning rate. Defaults to 1e-3.
learning_rate_max – Maximum value for learning rate. Defaults to 0.3.
iterations_min – Minimum number of boosting rounds (iterations). Defaults to 1000.
iterations_max – Maximum number of boosting rounds (iterations). Defaults to 1000.
verbosity_during_hyperparameter_tuning – Verbosity level during hyperparameter tuning. Defaults to 0.
verbosity_during_final_model_training – Verbosity level during final model training. Defaults to 0.
bootstrap_type – List of bootstrap types to consider. Defaults to [“Bayesian”, “Poisson”, “MVS”, “No”].
grow_policy – List of grow policies. Defaults to [“SymmetricTree”].
catboost_objective – CatBoost objective. Defaults to “RMSE”.
catboost_eval_metric – CatBoost evaluation metric. Defaults to “RMSE”.
catboost_eval_metric_tune_direction – Direction to tune the evaluation metric. Defaults to “minimize”. Must be any of [‘minimize’, ‘maximize’]
- dict()¶
Return dictionary with all class attributes.
The implementation keeps backwards compatibility as this class mimics a Pydantic BaseModel.