bluecast.preprocessing.feature_creation

Module Contents

Classes

AddRowLevelAggFeatures

GroupLevelAggFeatures

FeatureClusteringScorer

Functions

add_groupby_agg_feats(→ pandas.DataFrame)

Add groupby aggregation features to a DataFrame.

class bluecast.preprocessing.feature_creation.AddRowLevelAggFeatures
get_original_features(df: pandas.DataFrame, target_col: str | None) None
add_row_level_mean(df: pandas.DataFrame, feature_to_agg: List[str | int | float], agg_col_name: str = 'row_mean') pandas.DataFrame

Add row level mean of features to a dataframe.

Parameters:
  • df – Pandas DataFrame holding all features.

  • feature_to_agg – List of column names indicating which features to aggregate.

  • agg_col_name – Name of the new column.

Returns:

Original Pandas DataFrame with added row level means.

add_row_level_std(df: pandas.DataFrame, feature_to_agg: List[str | int | float], agg_col_name: str = 'row_std') pandas.DataFrame

Add row level standard deviation of features to a dataframe.

Parameters:
  • df – Pandas DataFrame holding all features.

  • feature_to_agg – List of column names indicating which features to aggregate.

  • agg_col_name – Name of the new column.

Returns:

Original Pandas DataFrame with added row level means.

add_row_level_min(df: pandas.DataFrame, feature_to_agg: List[str | int | float], agg_col_name: str = 'row_min') pandas.DataFrame

Add row level min of features to a dataframe.

Parameters:
  • df – Pandas DataFrame holding all features.

  • feature_to_agg – List of column names indicating which features to aggregate.

  • agg_col_name – Name of the new column.

Returns:

Original Pandas DataFrame with added row level means.

add_row_level_max(df: pandas.DataFrame, feature_to_agg: List[str | int | float], agg_col_name: str = 'row_max') pandas.DataFrame

Add row level max of features to a dataframe.

Parameters:
  • df – Pandas DataFrame holding all features.

  • feature_to_agg – List of column names indicating which features to aggregate.

  • agg_col_name – Name of the new column.

Returns:

Original Pandas DataFrame with added row level means.

add_row_level_sum(df: pandas.DataFrame, feature_to_agg: List[str | int | float], agg_col_name: str = 'row_sum') pandas.DataFrame

Add row level sum of features to a dataframe.

Parameters:
  • df – Pandas DataFrame holding all features.

  • feature_to_agg – List of column names indicating which features to aggregate.

  • agg_col_name – Name of the new column.

Returns:

Original Pandas DataFrame with added row level means.

add_row_level_agg_features(df: pandas.DataFrame, target_col: str | None = None) pandas.DataFrame
class bluecast.preprocessing.feature_creation.GroupLevelAggFeatures
create_groupby_agg_features(df: pandas.DataFrame | polars.DataFrame, groupby_columns: List[str], columns_to_agg: List[str] | None, target_col: str | None, aggregations: List[str] | None = None) pandas.DataFrame

Create aggregations based on groups for a given DataFrame.

Parameters:
  • df – Either Pandas or Polars DataFrame.

  • groupby_columns – List of column names to use for the groupby.

  • columns_to_agg – List of columns to aggregate. If empty all columns except target column (target_col) will be chosen.

  • target_col – Target column name. Will be ignored during aggregation.

  • aggregations – Aggregations to perform. If not provided, [“min”, “max”, “mean”, “sum”] will be used.

Returns:

Aggregated Pandas DataFrame

bluecast.preprocessing.feature_creation.add_groupby_agg_feats(df: pandas.DataFrame, groupby_cols: List[str], to_group_cols: List[str], num_col_prefix: str, target_col: str, aggregations: List[str]) pandas.DataFrame

Add groupby aggregation features to a DataFrame.

Parameters:
  • df – Pandas DataFrame containing all relevant columns.

  • groupby_cols – List of columns to use as groups.

  • to_group_cols – List of columns to aggregate

  • num_col_prefix – Prefix to add to the new columns

  • target_col – String indicating the target column

  • aggregations – List of aggregations to perform. If not provided, [“min”, “max”, “mean”, “sum”] will be used.

Returns:

Returns enriched DataFrame

class bluecast.preprocessing.feature_creation.FeatureClusteringScorer(cluster_settings: Dict[str, Any], random_state: int = 25)
_fit_reindex_clusters_by_mean(temp_df: pandas.DataFrame, feature_name: str, higher_is_better: bool = True) numpy.ndarray

Fix cluster indices.

Cluster indices do not follow the order of the original feature (i.e. highest value might nbe cluster 0). This function reindexes the cluster idx, so the total value make sense.

Parameters:
  • temp_df – DataFrame containing two columns: the ‘cluster’ and the original feature

  • feature_name – String indicating the name of the original feature.

  • higher_is_better – Boolean indicating if the cluster index should raise with increasing values of the original feature.

Returns:

Nmpy array with corrected cluster indices

_predict_reindex_clusters_by_mean(temp_df: pandas.DataFrame, feature_name: str) numpy.ndarray

Fix cluster indices.

Cluster indices do not follow the order of the original feature (i.e. highest value might nbe cluster 0). This function reindexes the cluster idx, so the total value make sense.

Parameters:
  • temp_df – DataFrame containing two columns: the ‘cluster’ and the original feature

  • feature_name – String indicating the name of the original feature.

Returns:

Nmpy array with corrected cluster indices

_fit_cluster_feature(df: pandas.DataFrame, feature_name: str, nb_clusters: int, higher_is_better: bool) numpy.ndarray

Cluster individual feature.

Parameters:
  • df – DataFrame with original features.

  • feature_name – String indicating the feature name.

  • nb_clusters – Integer indicating how many clusters shall be found.

Returns:

Numpy array with cluster ids

_predict_cluster_feature(df: pandas.DataFrame, feature_name: str) numpy.ndarray

Cluster individual feature.

Parameters:
  • df – DataFrame with original features.

  • feature_name – String indicating the feature name.

Returns:

Numpy array with cluster ids

fit_predict_cluster(df: pandas.DataFrame, keep_original_features: bool = True)

Calculate cluster (i.e. RFM) scores based on input features.

Parameters:
  • df – Pandas DataFrame including the original features. Additional feature will be ignored.

  • keep_original_features – If true, return clusters and original dataframe. Otherwise return RFM results only.

Returns:

Pandas DataFrame with RFM scores

predict_cluster(df: pandas.DataFrame, keep_original_features: bool = True)

Calculate cluster (i.e. RFM) scores based on input features.

Parameters:
  • df – Pandas DataFrame including the original features. Additional feature will be ignored.

  • keep_original_features – If true, return clusters and original dataframe. Otherwise return RFM results only.

Returns:

Pandas DataFrame with RFM scores