bluecast.config.training_config¶
Define training and common configuration parameters.
Pydantic dataclasses are used to define the configuration parameters. This allows for type checking and validation of the configuration parameters. The configuration parameters are used in the training pipeline and in the evaluation pipeline. Pydantic dataclasses are used to allow users a pythonic way to define the configuration parameters. Default configurations can be loaded, adjusted and passed into the blueprints.
Module Contents¶
Classes¶
Define general training parameters. |
|
Define hyperparameter tuning search space. |
|
Define hyperparameter tuning search space. |
|
Define final hyper parameters. |
|
Define final hyper parameters. |
- class bluecast.config.training_config.TrainingConfig(/, **data: Any)¶
Bases:
pydantic.BaseModelDefine general training parameters.
- Parameters:
global_random_state – Global random state to use for reproducibility.
increase_random_state_in_bluecast_cv_by – In BlueCastCV multiple models are trained. Define by how much the random state changes with each additional model.
shuffle_during_training – Whether to shuffle the data during training when hypertuning_cv_folds > 1.
hyperparameter_tuning_rounds – Number of hyperparameter tuning rounds. Not used when custom ML model is passed.
hyperparameter_tuning_max_runtime_secs – Maximum runtime in seconds for hyperparameter tuning. Not used when custom ML model is passed.
hypertuning_cv_folds – Number of cross-validation folds to use for hyperparameter tuning. Not used when custom ML model is passed.
hypertuning_cv_repeats – Number of repetitions for each cross-validation fold during hyperparameter tuning. Not used when custom ML model is passed.
sample_data_during_tuning – Whether to sample the data during tuning. Not used when custom ML model is passed.
sample_data_during_tuning_alpha – Alpha value for sampling the data during tuning. The higher alpha the fewer samples will be left. Not used when custom ML model is passed.
class_weight_during_dmatrix_creation – Whether to use class weights during DMatrix creation. Not used when custom ML model is passed.
early_stopping_rounds – Number of early stopping rounds during final training or when hyperparameter tuning follows a single train-test split. Not used when custom ML model is passed.
autotune_model – Whether to autotune the model. Not used when custom ML model is passed.
autotune_on_device – Whether to autotune on CPU or GPU. Chose any of [“auto”, “gpu”, “cpu”]. Not used when custom ML model is passed.
autotune_n_random_seeds – Number of random seeds to use for autotuning. This changes Optuna’s random seed only. Will be updated back after every nth trial back again. Not used when custom ML model is passed.
update_hyperparameter_search_space_after_nth_trial – Update the hyperparameter search space after the nth trial. Not used when custom ML model is passed.
plot_hyperparameter_tuning_overview – Whether to plot the hyperparameter tuning overview. Not used when custom ML model is passed.
enable_feature_selection – Whether to enable recursive feature selection.
calculate_shap_values – Whether to calculate shap values. Also used when custom ML model is passed. Not compatible with all ML models. See the SHAP documentation for more details.
shap_waterfall_indices – List of sample indices to plot. Each index represents a sample (i.e.: [0, 1, 499]).
show_dependence_plots_of_top_n_features – Maximum number of dependence plots to show. Not used when custom ML model is passed.
store_shap_values_in_instance – Whether to store the SHAP values in the BlueCast instance. Not applicable when custom ML model is used.
train_size – Train size to use for train-test split.
train_split_stratify – Whether to stratify the train-test split. Not used when custom ML model is passed.
use_full_data_for_final_model – Whether to use the full data for the final model. This might cause overfitting. Not used when custom ML model is passed.
cardinality_threshold_for_onehot_encoding – Categorical features with a cardinality of less or equal this threshold will be onehot encoded. The rest will be target encoded. Will be ignored if cat_encoding_via_ml_algorithm is set to true.
infrequent_categories_threshold – Categories with a frequency of less this threshold will be grouped into a common group. This is done to reduce the risk of overfitting. Will be ignored if cat_encoding_via_ml_algorithm is set to true.
cat_encoding_via_ml_algorithm – Whether to use an ML algorithm for categorical encoding. If True, the categorical encoding is done via a ML algorithm. If False, the categorical encoding is done via a target encoding in the preprocessing steps. See the ReadMe for more details.
show_detailed_tuning_logs – Whether to show detailed tuning logs. Not used when custom ML model is passed.
enable_grid_search_fine_tuning – After hyperparameter tuning run Gridsearch tuning on a fine-grained grid based on the previous hyperparameter tuning. Only possible when autotune_model is True.
gridsearch_nb_parameters_per_grid – Decides how many steps the grid shall have per parameter.
gridsearch_tuning_max_runtime_secs – Sets the maximum time in seconds the tuning shall run. This will finish the latest trial nd will exceed this limit though.
experiment_name – Name of the experiment. Will be logged inside the ExperimentTracker.
logging_file_path – Path to the logging file. If None, the logging will be printed to the Jupyter notebook instead.
out_of_fold_dataset_store_path – Path to store the out of fold dataset. If None, the out of fold dataset will not be stored. Shall end with a slash. Only used when BlueCast instances are called with fit_eval method.
- global_random_state: int = 33¶
- increase_random_state_in_bluecast_cv_by: int = 200¶
- shuffle_during_training: bool = True¶
- hyperparameter_tuning_rounds: int = 200¶
- hyperparameter_tuning_max_runtime_secs: int = 3600¶
- hypertuning_cv_folds: int = 5¶
- hypertuning_cv_repeats: int = 1¶
- sample_data_during_tuning: bool = False¶
- sample_data_during_tuning_alpha: float = 2.0¶
- precise_cv_tuning: bool = False¶
- early_stopping_rounds: int | None = 20¶
- autotune_model: bool = True¶
- autotune_on_device: Literal[auto, gpu, cpu] = 'auto'¶
- autotune_n_random_seeds: int = 1¶
- update_hyperparameter_search_space_after_nth_trial: int = 200¶
- plot_hyperparameter_tuning_overview: bool = True¶
- enable_feature_selection: bool = False¶
- calculate_shap_values: bool = True¶
- shap_waterfall_indices: List[int] = []¶
- show_dependence_plots_of_top_n_features: int = 0¶
- store_shap_values_in_instance: bool = False¶
- train_size: float = 0.8¶
- train_split_stratify: bool = True¶
- use_full_data_for_final_model: bool = True¶
- cardinality_threshold_for_onehot_encoding: int = 5¶
- infrequent_categories_threshold: int = 5¶
- cat_encoding_via_ml_algorithm: bool = False¶
- show_detailed_tuning_logs: bool = False¶
- optuna_sampler_n_startup_trials: int = 10¶
- enable_grid_search_fine_tuning: bool = False¶
- gridsearch_tuning_max_runtime_secs: int = 3600¶
- gridsearch_nb_parameters_per_grid: int = 5¶
- bluecast_cv_train_n_model: Tuple[int, int] = (5, 1)¶
- logging_file_path: str | None¶
- experiment_name: str = 'new experiment'¶
- out_of_fold_dataset_store_path: str | None¶
- class bluecast.config.training_config.XgboostTuneParamsConfig(/, **data: Any)¶
Bases:
pydantic.BaseModelDefine hyperparameter tuning search space.
- max_depth_min: int = 1¶
- max_depth_max: int = 10¶
- alpha_min: float = 1e-08¶
- alpha_max: float = 100¶
- lambda_min: float = 1¶
- lambda_max: float = 100¶
- gamma_min: float = 1e-08¶
- gamma_max: float = 10¶
- min_child_weight_min: float = 1¶
- min_child_weight_max: float = 100¶
- sub_sample_min: float = 0.1¶
- sub_sample_max: float = 1.0¶
- col_sample_by_tree_min: float = 0.1¶
- col_sample_by_tree_max: float = 1.0¶
- col_sample_by_level_min: float = 1.0¶
- col_sample_by_level_max: float = 1.0¶
- max_bin_min: int = 128¶
- max_bin_max: int = 1024¶
- eta_min: float = 0.001¶
- eta_max: float = 0.3¶
- steps_min: int = 1000¶
- steps_max: int = 1000¶
- verbosity_during_hyperparameter_tuning: int = 0¶
- verbosity_during_final_model_training: int = 0¶
- booster: List[str] = ['gbtree']¶
- grow_policy: List[str] = ['depthwise', 'lossguide']¶
- tree_method: List[str] = ['exact', 'approx', 'hist']¶
- xgboost_objective: str = 'multi:softprob'¶
- xgboost_eval_metric: str = 'mlogloss'¶
- xgboost_eval_metric_tune_direction: Literal[minimize, maximize] = 'minimize'¶
- class bluecast.config.training_config.XgboostTuneParamsRegressionConfig(/, **data: Any)¶
Bases:
pydantic.BaseModelDefine hyperparameter tuning search space.
- max_depth_min: int = 1¶
- max_depth_max: int = 10¶
- alpha_min: float = 1e-08¶
- alpha_max: float = 100¶
- lambda_min: float = 1¶
- lambda_max: float = 100¶
- gamma_min: float = 1e-08¶
- gamma_max: float = 10¶
- min_child_weight_min: float = 1¶
- min_child_weight_max: float = 100¶
- sub_sample_min: float = 0.1¶
- sub_sample_max: float = 1.0¶
- col_sample_by_tree_min: float = 0.1¶
- col_sample_by_tree_max: float = 1.0¶
- col_sample_by_level_min: float = 1.0¶
- col_sample_by_level_max: float = 1.0¶
- max_bin_min: int = 128¶
- max_bin_max: int = 1025¶
- eta_min: float = 0.001¶
- eta_max: float = 0.3¶
- steps_min: int = 1000¶
- steps_max: int = 1000¶
- verbosity_during_hyperparameter_tuning: int = 0¶
- verbosity_during_final_model_training: int = 0¶
- booster: List[str] = ['gbtree']¶
- grow_policy: List[str] = ['depthwise', 'lossguide']¶
- tree_method: List[str] = ['exact', 'approx', 'hist']¶
- xgboost_objective: str = 'reg:squarederror'¶
- xgboost_eval_metric: str = 'rmse'¶
- xgboost_eval_metric_tune_direction: Literal[minimize, maximize] = 'minimize'¶