bluecast.eda.analyse¶
Module Contents¶
Functions¶
|
Create a pie chart with labels, sizes, and optional explosion. |
|
Compare the counts between two DataFrames of the chosen provided categorical column. |
|
Compare the counts between two DataFrames of each categorical column in the provided list. |
|
Plots univariate plots for all the columns in the dataframe. Only numerical columns are expected. |
|
Plots bivariate plots for all column combinations in the dataframe. |
|
Plots half of the heatmap showing correlations of all features. |
|
Plots correlations for all the columns in the dataframe in relation to the target column. |
Creates scatter plots for each column in num_columns against the target_col. |
|
|
Plots PCA for the dataframe. The target column must be part of the provided DataFrame. |
|
Plot the cumulative variance of principal components. |
|
Plots PCA biplot for the dataframe. |
|
Plots t-SNE for the dataframe. The target column must be part of the provided DataFrame. |
|
|
|
|
|
Plot a heatmap for categorical data using Theil's U. |
|
|
|
Check if the columns have an amount of unique values that is almost the number of total rows (being above the defined threshold) |
|
Plot distribution of target across categorical features. |
|
Plots mutual information scores for all the categorical columns in the DataFrame in relation to the target column. |
|
Plot the empirical cumulative density function (ECDF) and histogram. |
|
Plot the distribution of a feature over time. |
|
Plots bivariate plots for each column in the dataframe with respect to the target. |
|
Plot Andrews curve. |
|
Compare distributions of two datasets for a given feature. |
- bluecast.eda.analyse.find_bind_with_with_freedman_diaconis(data: numpy.ndarray)¶
- bluecast.eda.analyse.plot_pie_chart(df: pandas.DataFrame, column: str, explode: List[float] | None = None, colors: List[str] | None = None) None¶
Create a pie chart with labels, sizes, and optional explosion.
Parameters: - df: Pandas DataFrame holding the column of nterest - column: The column to be plottted - explode: (Optional) List of numerical values, representing the explosion distance for each segment. - colors: (Optional) List with hexadecimal representations of colors in the RGB color model
- bluecast.eda.analyse.plot_count_pair(df_1: pandas.DataFrame, df_2: pandas.DataFrame, df_aliases: List[str] | None, feature: str, order: List[str] | None = None, palette: List[str] | None = None) None¶
Compare the counts between two DataFrames of the chosen provided categorical column.
- Parameters:
df_1 – Pandas DataFrame. I.e.: df_1 dataset
df_2 – Pandas DataFrame. I.e.: Test dataset
df_aliases – List with names of DataFrames that shall be shown on the count plots to represent them. Format: [df_1 representation, df_2 representation]
feature – String indicating categorical column to plot
hue – Read the sns.countplot
order – List with category names to define the order they appear in the plot
palette – List with hexadecimal representations of colors in the RGB color model
- bluecast.eda.analyse.plot_count_pairs(df_1: pandas.DataFrame, df_2: pandas.DataFrame, cat_cols: List[str], df_aliases: List[str] | None = None, palette: List[str] | None = None) None¶
Compare the counts between two DataFrames of each categorical column in the provided list.
- Parameters:
df_1 – Pandas DataFrame. I.e.: Train dataset
df_2 – Pandas DataFrame. I.e.: Test dataset
df_aliases – List with names of DataFrames that shall be shown on the count plots to represent them. Format: [df_1 representation, df_2 representation]
cat_cols – List with strings indicating categorical column names to plot
palette – List with hexadecimal representations of colors in the RGB color model
- bluecast.eda.analyse.univariate_plots(df: pandas.DataFrame, col_requires_at_least_n_values: int = 5) None¶
Plots univariate plots for all the columns in the dataframe. Only numerical columns are expected. The target column does not need to be part of the provided DataFrame.
Expects numeric columns only. The number of bins will be determined using the Freedman-Diaconis rule.
- Parameters:
df – DataFrame holding the features.
col_requires_at_least_n_values – Minimum number of unique values required to plot the feature. If number of unique features is less, the column will be skipped.
- bluecast.eda.analyse.bi_variate_plots(df: pandas.DataFrame, target: str, num_cols_grid: int = 4) None¶
Plots bivariate plots for all column combinations in the dataframe. The target column must be part of the provided DataFrame. Param num_cols_grid specifies how many columns the grid shall have.
Expects numeric columns only.
- bluecast.eda.analyse.correlation_heatmap(df: pandas.DataFrame) None¶
Plots half of the heatmap showing correlations of all features.
Expects numeric columns only.
- bluecast.eda.analyse.correlation_to_target(df: pandas.DataFrame, target: str) None¶
Plots correlations for all the columns in the dataframe in relation to the target column. The target column must be part of the provided DataFrame.
Expects numeric columns only.
- bluecast.eda.analyse.plot_against_target_for_regression(df: pandas.DataFrame, num_columns: List[int | float | str], target_col: str) None¶
Creates scatter plots for each column in num_columns against the target_col. Draws a regression line and shows the p-value for the regression line.
Parameters: - df: pd.DataFrame -> The input dataframe containing the data. - num_columns: List[Union[int, float, str]] -> List of column names to plot against the target column. - target_col: str -> The target column name for regression.
Returns: - None -> The function displays plots.
- bluecast.eda.analyse.plot_pca(df: pandas.DataFrame, target: str, scale_data: bool = True) None¶
Plots PCA for the dataframe. The target column must be part of the provided DataFrame.
Expects numeric columns only. :param df: Pandas DataFrame. Should not include the target variable. :param target: String indicating the target column. :param scale_data: If true, standard scaling will be performed before applying PCA, otherwise the raw data is used.
- bluecast.eda.analyse.plot_pca_cumulative_variance(df: pandas.DataFrame, scale_data: bool = True, n_components: int = 10) None¶
Plot the cumulative variance of principal components.
- Parameters:
df – Pandas DataFrame. Should not include the target variable.
scale_data – If true, standard scaling will be performed before applying PCA, otherwise the raw data is used.
n_components – Number of total components to compute.
- bluecast.eda.analyse.plot_pca_biplot(df: pandas.DataFrame, target: str, scale_data: bool = True) None¶
Plots PCA biplot for the dataframe.
Expects numeric columns only.
- Arrow direction: Indicates how the corresponding variable is aligned with the principal component.
Arrows that point in the same direction are positively correlated. Arrows pointing in the opposite direction are negatively correlated.
- Arrow length: Shows how much the variable contributes to the principal components. Longer arrows mean stronger
contribution (the variable accounts for more explained variance). Shorter arrows mean weaker contribution (the variable accounts for less explained variance).
- Angle between arrows: 0º indicates a perfect positive correlation. 180º indicates a perfect negative correlation.
90º indicates no correlation.
- Parameters:
df – Pandas DataFrame.
target – String indicating the target column. Will be dropped if part of the DataFrame.
scale_data – If true, standard scaling will be performed before applying PCA, otherwise the raw data is used.
- bluecast.eda.analyse.plot_tsne(df: pandas.DataFrame, target: str, perplexity=50, random_state=42, scale_data: bool = True) None¶
Plots t-SNE for the dataframe. The target column must be part of the provided DataFrame.
Expects numeric columns only. :param df: Pandas DataFrame. Should not include the target variable. :param target: String indicating which column is the target column. Must be part of the provided DataFrame. :param perplexity: The perplexity parameter for t-SNE :param random_state: The random state for t-SNE :param scale_data: If true, standard scaling will be performed before applying t-SNE, otherwise the raw data is used.
- bluecast.eda.analyse.conditional_entropy(x, y)¶
- bluecast.eda.analyse.theil_u(x, y)¶
- bluecast.eda.analyse.plot_theil_u_heatmap(data: pandas.DataFrame, columns: List[str | int | float])¶
Plot a heatmap for categorical data using Theil’s U.
- bluecast.eda.analyse.plot_null_percentage(dataframe: pandas.DataFrame) None¶
- bluecast.eda.analyse.check_unique_values(df: pandas.DataFrame, columns: List[str | int | float], threshold: float = 0.9) List[str | int | float]¶
Check if the columns have an amount of unique values that is almost the number of total rows (being above the defined threshold)
- Parameters:
df – The pandas DataFrame to check
columns – A list of column names to check
threshold – The threshold to check against
- Returns:
A list of column names that have a high amount of unique values
- bluecast.eda.analyse.plot_classification_target_distribution_within_categories(df: pandas.DataFrame, cat_columns: List[str], target_col: str) None¶
Plot distribution of target across categorical features.
This suitable for classification tasks only. :param df: Pandas dataFrame. Must include the target column. :param cat_columns: List of categorical column names. :param target_col: String indicating the target column name. :return:
- bluecast.eda.analyse.mutual_info_to_target(df: pandas.DataFrame, target: str, class_problem: Literal[binary, multiclass, regression], **mut_params) None¶
Plots mutual information scores for all the categorical columns in the DataFrame in relation to the target column. The target column must be part of the provided DataFrame. :param df: DataFrame containing all columns including target column. Features are expected to be numerical. :param target: String indicating which column is the target column. :param class_problem: Any of [“binary”, “multiclass”, “regression”] :param mut_params: Dictionary passing additional arguments into sklearn’s mutual_info_classif function.
- bluecast.eda.analyse.plot_ecdf(df: pandas.DataFrame, columns: List[str | int | float], plot_all_at_once: bool = False) None¶
Plot the empirical cumulative density function (ECDF) and histogram.
Matplotlib contains a direct implementation at version 3.8 and higher, but this might run into dependency issues in environments with older data.
- Parameters:
df – DataFrame containing all columns including target column. Features are expected to be numerical.
columns – A list of column names to check.
plot_all_at_once – If True, plot all eCDFs in one plot. If False, plot each eCDF separately.
- bluecast.eda.analyse.plot_distribution_by_time(df: pandas.DataFrame, col_to_plot: str, date_col: str, xlabel: str = 'Week', ylabel: str = 'Feature distribution', title: str = 'Weekly distribution of the feature', freq: str = 'W') None¶
Plot the distribution of a feature over time.
- Parameters:
df – Pandas DataFrame
col_to_plot – String indicating which column to plot
date_col – String indicating which column to use as date
xlabel – String indicating the x-axis label
ylabel – String indicating the y-axis label
title – String indicating the title of the plot
freq – Label indicating the frequency of the time grouping. Must be one of Pandas’ Offset aliases.
- Returns:
Nothing
- bluecast.eda.analyse.plot_error_distributions(df: pandas.DataFrame, target: str, prediction_error: str, num_cols_grid: int = 1, max_x_elements: int = 5) None¶
Plots bivariate plots for each column in the dataframe with respect to the target. Each subplot represents unique values of the target column. The ‘prediction_error’ is plotted using unique values of the target column as the hue. Param num_cols_grid specifies how many columns the grid shall have. max_x_elements determines the maximum number of unique values on the x-axis per plot.
- bluecast.eda.analyse.plot_andrews_curve(df: pandas.DataFrame, target: str, n_samples: int | None = 200, random_state=500) None¶
Plot Andrews curve.
Andrews Curve helps visualize if there are inherent groupings of the numerical features based on a given grouping.
- Parameters:
df – Pandas DataFrame
target – String indicating the target column
n_samples – Int indicating how many samples shall be shown. If None, the full DataFrame is taken.
random_state – Random seed determining the DataFrame sampling.
- Returns:
None
- bluecast.eda.analyse.plot_distribution_pairs(df1: pandas.DataFrame, df2: pandas.DataFrame, feature: str, palette: List[str] | None = None) None¶
Compare distributions of two datasets for a given feature.
Only the central 95% of the data is considered for the histogram.
- Parameters:
df1 – DataFrame containing the feature.
df2 – Second DataFrame containing the feature for comparison.
feature – String indicating the feature name
palette – List of colors to use for the plots.