:py:mod:`bluecast.preprocessing.feature_creation` ================================================= .. py:module:: bluecast.preprocessing.feature_creation Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: bluecast.preprocessing.feature_creation.AddRowLevelAggFeatures bluecast.preprocessing.feature_creation.GroupLevelAggFeatures bluecast.preprocessing.feature_creation.FeatureClusteringScorer Functions ~~~~~~~~~ .. autoapisummary:: bluecast.preprocessing.feature_creation.add_groupby_agg_feats .. py:class:: AddRowLevelAggFeatures .. py:method:: get_original_features(df: pandas.DataFrame, target_col: Optional[str]) -> None .. py:method:: add_row_level_mean(df: pandas.DataFrame, feature_to_agg: List[Union[str, int, float]], agg_col_name: str = 'row_mean') -> pandas.DataFrame Add row level mean of features to a dataframe. :param df: Pandas DataFrame holding all features. :param feature_to_agg: List of column names indicating which features to aggregate. :param agg_col_name: Name of the new column. :return: Original Pandas DataFrame with added row level means. .. py:method:: add_row_level_std(df: pandas.DataFrame, feature_to_agg: List[Union[str, int, float]], agg_col_name: str = 'row_std') -> pandas.DataFrame Add row level standard deviation of features to a dataframe. :param df: Pandas DataFrame holding all features. :param feature_to_agg: List of column names indicating which features to aggregate. :param agg_col_name: Name of the new column. :return: Original Pandas DataFrame with added row level means. .. py:method:: add_row_level_min(df: pandas.DataFrame, feature_to_agg: List[Union[str, int, float]], agg_col_name: str = 'row_min') -> pandas.DataFrame Add row level min of features to a dataframe. :param df: Pandas DataFrame holding all features. :param feature_to_agg: List of column names indicating which features to aggregate. :param agg_col_name: Name of the new column. :return: Original Pandas DataFrame with added row level means. .. py:method:: add_row_level_max(df: pandas.DataFrame, feature_to_agg: List[Union[str, int, float]], agg_col_name: str = 'row_max') -> pandas.DataFrame Add row level max of features to a dataframe. :param df: Pandas DataFrame holding all features. :param feature_to_agg: List of column names indicating which features to aggregate. :param agg_col_name: Name of the new column. :return: Original Pandas DataFrame with added row level means. .. py:method:: add_row_level_sum(df: pandas.DataFrame, feature_to_agg: List[Union[str, int, float]], agg_col_name: str = 'row_sum') -> pandas.DataFrame Add row level sum of features to a dataframe. :param df: Pandas DataFrame holding all features. :param feature_to_agg: List of column names indicating which features to aggregate. :param agg_col_name: Name of the new column. :return: Original Pandas DataFrame with added row level means. .. py:method:: add_row_level_agg_features(df: pandas.DataFrame, target_col: Optional[str] = None) -> pandas.DataFrame .. py:class:: GroupLevelAggFeatures .. py:method:: create_groupby_agg_features(df: Union[pandas.DataFrame, polars.DataFrame], groupby_columns: List[str], columns_to_agg: Optional[List[str]], target_col: Optional[str], aggregations: Optional[List[str]] = None) -> pandas.DataFrame Create aggregations based on groups for a given DataFrame. :param df: Either Pandas or Polars DataFrame. :param groupby_columns: List of column names to use for the groupby. :param columns_to_agg: List of columns to aggregate. If empty all columns except target column (target_col) will be chosen. :param target_col: Target column name. Will be ignored during aggregation. :param aggregations: Aggregations to perform. If not provided, ["min", "max", "mean", "sum"] will be used. :return: Aggregated Pandas DataFrame .. py:function:: add_groupby_agg_feats(df: pandas.DataFrame, groupby_cols: List[str], to_group_cols: List[str], num_col_prefix: str, target_col: str, aggregations: List[str]) -> pandas.DataFrame Add groupby aggregation features to a DataFrame. :param df: Pandas DataFrame containing all relevant columns. :param groupby_cols: List of columns to use as groups. :param to_group_cols: List of columns to aggregate :param num_col_prefix: Prefix to add to the new columns :param target_col: String indicating the target column :param aggregations: List of aggregations to perform. If not provided, ["min", "max", "mean", "sum"] will be used. :return: Returns enriched DataFrame .. py:class:: FeatureClusteringScorer(cluster_settings: Dict[str, Any], random_state: int = 25) .. py:method:: _fit_reindex_clusters_by_mean(temp_df: pandas.DataFrame, feature_name: str, higher_is_better: bool = True) -> numpy.ndarray Fix cluster indices. Cluster indices do not follow the order of the original feature (i.e. highest value might nbe cluster 0). This function reindexes the cluster idx, so the total value make sense. :param temp_df: DataFrame containing two columns: the 'cluster' and the original feature :param feature_name: String indicating the name of the original feature. :param higher_is_better: Boolean indicating if the cluster index should raise with increasing values of the original feature. :return: Nmpy array with corrected cluster indices .. py:method:: _predict_reindex_clusters_by_mean(temp_df: pandas.DataFrame, feature_name: str) -> numpy.ndarray Fix cluster indices. Cluster indices do not follow the order of the original feature (i.e. highest value might nbe cluster 0). This function reindexes the cluster idx, so the total value make sense. :param temp_df: DataFrame containing two columns: the 'cluster' and the original feature :param feature_name: String indicating the name of the original feature. :return: Nmpy array with corrected cluster indices .. py:method:: _fit_cluster_feature(df: pandas.DataFrame, feature_name: str, nb_clusters: int, higher_is_better: bool) -> numpy.ndarray Cluster individual feature. :param df: DataFrame with original features. :param feature_name: String indicating the feature name. :param nb_clusters: Integer indicating how many clusters shall be found. :return: Numpy array with cluster ids .. py:method:: _predict_cluster_feature(df: pandas.DataFrame, feature_name: str) -> numpy.ndarray Cluster individual feature. :param df: DataFrame with original features. :param feature_name: String indicating the feature name. :return: Numpy array with cluster ids .. py:method:: fit_predict_cluster(df: pandas.DataFrame, keep_original_features: bool = True) Calculate cluster (i.e. RFM) scores based on input features. :param df: Pandas DataFrame including the original features. Additional feature will be ignored. :param keep_original_features: If true, return clusters and original dataframe. Otherwise return RFM results only. :return: Pandas DataFrame with RFM scores .. py:method:: predict_cluster(df: pandas.DataFrame, keep_original_features: bool = True) Calculate cluster (i.e. RFM) scores based on input features. :param df: Pandas DataFrame including the original features. Additional feature will be ignored. :param keep_original_features: If true, return clusters and original dataframe. Otherwise return RFM results only. :return: Pandas DataFrame with RFM scores