:py:mod:`bluecast.eda.analyse`
==============================

.. py:module:: bluecast.eda.analyse


Module Contents
---------------


Functions
~~~~~~~~~

.. autoapisummary::

   bluecast.eda.analyse._create_data_hash
   bluecast.eda.analyse._cached_plot_computation
   bluecast.eda.analyse._entropy_fallback
   bluecast.eda.analyse.find_bind_with_with_freedman_diaconis
   bluecast.eda.analyse.plot_pie_chart
   bluecast.eda.analyse.plot_count_pair
   bluecast.eda.analyse.plot_count_pairs
   bluecast.eda.analyse.univariate_plots
   bluecast.eda.analyse.bi_variate_plots
   bluecast.eda.analyse.correlation_heatmap
   bluecast.eda.analyse.correlation_to_target
   bluecast.eda.analyse.plot_against_target_for_regression
   bluecast.eda.analyse.plot_pca
   bluecast.eda.analyse.plot_pca_cumulative_variance
   bluecast.eda.analyse.plot_pca_biplot
   bluecast.eda.analyse.plot_tsne
   bluecast.eda.analyse.conditional_entropy
   bluecast.eda.analyse.theil_u
   bluecast.eda.analyse.plot_theil_u_heatmap
   bluecast.eda.analyse.plot_null_percentage
   bluecast.eda.analyse.check_unique_values
   bluecast.eda.analyse.plot_classification_target_distribution_within_categories
   bluecast.eda.analyse.mutual_info_to_target
   bluecast.eda.analyse.plot_ecdf
   bluecast.eda.analyse.plot_distribution_by_time
   bluecast.eda.analyse.plot_error_distributions
   bluecast.eda.analyse.plot_andrews_curve
   bluecast.eda.analyse.plot_distribution_pairs
   bluecast.eda.analyse.plot_benfords_law
   bluecast.eda.analyse._create_gradient_bar_chart
   bluecast.eda.analyse.plot_category_frequency
   bluecast.eda.analyse.plot_missing_values_matrix
   bluecast.eda.analyse._dashboard_update_plot
   bluecast.eda.analyse._dashboard_update_summary
   bluecast.eda.analyse._apply_pandas_query_filter
   bluecast.eda.analyse._create_outlier_detection_plot
   bluecast.eda.analyse._create_benford_plot
   bluecast.eda.analyse._create_category_frequency_plot
   bluecast.eda.analyse._create_violin_plot
   bluecast.eda.analyse._create_theil_u_plot
   bluecast.eda.analyse._create_ecdf_plot
   bluecast.eda.analyse._dashboard_update_regression_plot
   bluecast.eda.analyse._create_benford_plot_classification
   bluecast.eda.analyse._create_category_frequency_plot_classification
   bluecast.eda.analyse._dashboard_update_classification_plot
   bluecast.eda.analyse.create_eda_dashboard_regression
   bluecast.eda.analyse.create_eda_dashboard_classification
   bluecast.eda.analyse.create_eda_dashboard


Attributes
~~~~~~~~~~

.. autoapisummary::

   bluecast.eda.analyse.HAS_ISOLATION_FOREST
   bluecast.eda.analyse.HAS_SHAP
   bluecast.eda.analyse.HAS_PANDAS_QUERY
   bluecast.eda.analyse.HAS_WORDCLOUD
   bluecast.eda.analyse.HAS_SCIPY
   bluecast.eda.analyse.HAS_STATSMODELS
   bluecast.eda.analyse._plot_cache


.. py:data:: HAS_ISOLATION_FOREST
   :value: True

   
.. py:data:: HAS_SHAP
   :value: True

   
.. py:data:: HAS_PANDAS_QUERY
   :value: True

   
.. py:data:: HAS_WORDCLOUD
   :value: True

   
.. py:data:: HAS_SCIPY
   :value: True

   
.. py:data:: HAS_STATSMODELS
   :value: True

   
.. py:data:: _plot_cache
   :type: Dict[str, Any]

   
.. py:function:: _create_data_hash(df: pandas.DataFrame, *args) -> str

   Create a hash from DataFrame and additional arguments for caching.


.. py:function:: _cached_plot_computation(func)

   Decorator to cache expensive plot computations.


.. py:function:: _entropy_fallback(p_x)

   Fallback implementation for entropy calculation when scipy is not available.
   Uses natural logarithm to match scipy.stats.entropy behavior.

   :param p_x: List of probabilities
   :return: Shannon entropy (using natural logarithm)


.. py:function:: find_bind_with_with_freedman_diaconis(data: numpy.ndarray)


.. py:function:: plot_pie_chart(df: pandas.DataFrame, column: str, explode: Optional[List[float]] = None, colors: Optional[List[str]] = None, show: bool = True) -> plotly.graph_objects.Figure

   Create a pie chart with labels, sizes, and optional explosion.

   Parameters:
   - df: Pandas DataFrame holding the column of interest
   - column: The column to be plotted
   - explode: (Optional) List of numerical values (not used in plotly version)
   - colors: (Optional) List with hexadecimal representations of colors in the RGB color model
   - show: Whether to display the plot

   Returns:
   - plotly.graph_objects.Figure: The pie chart figure


.. py:function:: plot_count_pair(df_1: pandas.DataFrame, df_2: pandas.DataFrame, df_aliases: Optional[List[str]], feature: str, order: Optional[List[str]] = None, palette: Optional[List[str]] = None, show: bool = True) -> plotly.graph_objects.Figure

   Compare the counts between two DataFrames of the chosen provided categorical column.

   :param df_1: Pandas DataFrame. I.e.: df_1 dataset
   :param df_2: Pandas DataFrame. I.e.: Test dataset
   :param df_aliases: List with names of DataFrames that shall be shown on the count plots to represent them.
       Format: [df_1 representation, df_2 representation]
   :param feature: String indicating categorical column to plot
   :param order: List with category names to define the order they appear in the plot
   :param palette: List with hexadecimal representations of colors in the RGB color model
   :param show: Whether to display the plot

   Returns:
   - plotly.graph_objects.Figure: The count plot figure


.. py:function:: plot_count_pairs(df_1: pandas.DataFrame, df_2: pandas.DataFrame, cat_cols: List[str], df_aliases: Optional[List[str]] = None, palette: Optional[List[str]] = None) -> None

   Compare the counts between two DataFrames of each categorical column in the provided list.

   :param df_1: Pandas DataFrame. I.e.: Train dataset
   :param df_2: Pandas DataFrame. I.e.: Test dataset
   :param df_aliases: List with names of DataFrames that shall be shown on the count plots to represent them.
       Format: [df_1 representation, df_2 representation]
   :param cat_cols: List with strings indicating categorical column names to plot
   :param palette: List with hexadecimal representations of colors in the RGB color model


.. py:function:: univariate_plots(df: pandas.DataFrame, col_requires_at_least_n_values: int = 5) -> None

   Plots univariate plots for all the columns in the dataframe. Only numerical columns are expected.
   The target column does not need to be part of the provided DataFrame.

   Expects numeric columns only. The number of bins will be determined using the Freedman-Diaconis rule.

   :param df: DataFrame holding the features.
   :param col_requires_at_least_n_values: Minimum number of unique values required to plot the feature.
       If number of unique features is less, the column will be skipped.


.. py:function:: bi_variate_plots(df: pandas.DataFrame, target: str, num_cols_grid: int = 4) -> None

   Plots bivariate plots for all column combinations in the dataframe.
   The target column must be part of the provided DataFrame.
   Param num_cols_grid specifies how many columns the grid shall have.

   Expects numeric columns only.


.. py:function:: correlation_heatmap(df: pandas.DataFrame, show: bool = True) -> plotly.graph_objects.Figure

   Plots half of the heatmap showing correlations of all features.

   Expects numeric columns only.

   Returns:
   - plotly.graph_objects.Figure: The correlation heatmap figure


.. py:function:: correlation_to_target(df: pandas.DataFrame, target: str, show: bool = True) -> plotly.graph_objects.Figure

   Plots correlations for all the columns in the dataframe in relation to the target column.
   The target column must be part of the provided DataFrame.

   Expects numeric columns only.

   Returns:
   - plotly.graph_objects.Figure: The correlation to target figure


.. py:function:: plot_against_target_for_regression(df: pandas.DataFrame, num_columns: List[Union[int, float, str]], target_col: str, show: bool = True) -> plotly.graph_objects.Figure

   Creates scatter plots for each column in num_columns against the target_col.
   Draws a regression line and shows statistical information.

   If statsmodels is available: Uses OLS regression and shows p-values.
   If statsmodels is unavailable: Uses numpy linear regression and shows correlation coefficients.

   Parameters:
   - df: pd.DataFrame -> The input dataframe containing the data.
   - num_columns: List[Union[int, float, str]] -> List of column names to plot against the target column.
   - target_col: str -> The target column name for regression.
   - show: Whether to display the plot

   Returns:
   - plotly.graph_objects.Figure: The regression plots figure


.. py:function:: plot_pca(df: pandas.DataFrame, target: str, scale_data: bool = True, show: bool = True) -> plotly.graph_objects.Figure

   Plots PCA for the dataframe. The target column must be part of the provided DataFrame.

   Handles missing values by dropping rows with any NaN values before PCA.

   Expects numeric columns only.
   :param df: Pandas DataFrame. Should include the target variable.
   :param target: String indicating the target column.
   :param scale_data: If true, standard scaling will be performed before applying PCA, otherwise the raw data is used.
   :param show: Whether to display the plot

   Returns:
   - plotly.graph_objects.Figure: The PCA plot figure


.. py:function:: plot_pca_cumulative_variance(df: pandas.DataFrame, scale_data: bool = True, n_components: int = 10, show: bool = True) -> plotly.graph_objects.Figure

   Plot the cumulative variance of principal components.

   Handles missing values by dropping rows with any NaN values before PCA.

   :param df: Pandas DataFrame. Should not include the target variable.
   :param scale_data: If true, standard scaling will be performed before applying PCA, otherwise the raw data is used.
   :param n_components: Number of total components to compute.
   :param show: Whether to display the plot

   Returns:
   - plotly.graph_objects.Figure: The PCA cumulative variance figure


.. py:function:: plot_pca_biplot(df: pandas.DataFrame, target: str, scale_data: bool = True, show: bool = True) -> plotly.graph_objects.Figure

   Plots PCA biplot for the dataframe.

   Handles missing values by dropping rows with any NaN values before PCA.

   Expects numeric columns only.

   :param df: Pandas DataFrame.
   :param target: String indicating the target column. Will be dropped if part of the DataFrame.
   :param scale_data: If true, standard scaling will be performed before applying PCA, otherwise the raw data is used.
   :param show: Whether to display the plot

   Returns:
   - plotly.graph_objects.Figure: The PCA biplot figure


.. py:function:: plot_tsne(df: pandas.DataFrame, target: str, perplexity=50, random_state=42, scale_data: bool = True, show: bool = True) -> plotly.graph_objects.Figure

   Plots t-SNE for the dataframe. The target column must be part of the provided DataFrame.

   Expects numeric columns only.
   :param df: Pandas DataFrame. Should include the target variable.
   :param target: String indicating which column is the target column. Must be part of the provided DataFrame.
   :param perplexity: The perplexity parameter for t-SNE
   :param random_state: The random state for t-SNE
   :param scale_data: If true, standard scaling will be performed before applying t-SNE, otherwise the raw data is used.
   :param show: Whether to display the plot

   Returns:
   - plotly.graph_objects.Figure: The t-SNE plot figure


.. py:function:: conditional_entropy(x, y)


.. py:function:: theil_u(x, y)


.. py:function:: plot_theil_u_heatmap(data: pandas.DataFrame, columns: List[Union[str, int, float]], show: bool = True) -> plotly.graph_objects.Figure

   Plot a heatmap for categorical data using Theil's U.

   Returns:
   - plotly.graph_objects.Figure: The Theil's U heatmap figure


.. py:function:: plot_null_percentage(dataframe: pandas.DataFrame, show: bool = True) -> plotly.graph_objects.Figure

   Plot the percentage of null values in each column.

   Returns:
   - plotly.graph_objects.Figure: The null percentage plot figure


.. py:function:: check_unique_values(df: pandas.DataFrame, columns: List[Union[str, int, float]], threshold: float = 0.9) -> List[Union[str, int, float]]

   Check if the columns have an amount of unique values that is almost the number of total rows (being above the defined threshold)

   :param df: The pandas DataFrame to check
   :param columns: A list of column names to check
   :param threshold: The threshold to check against
   :returns: A list of column names that have a high amount of unique values


.. py:function:: plot_classification_target_distribution_within_categories(df: pandas.DataFrame, cat_columns: List[str], target_col: str) -> None

   Plot distribution of target across categorical features.

   This suitable for classification tasks only.
   :param df: Pandas dataFrame. Must include the target column.
   :param cat_columns: List of categorical column names.
   :param target_col: String indicating the target column name.
   :return:


.. py:function:: mutual_info_to_target(df: pandas.DataFrame, target: str, class_problem: Literal[binary, multiclass, regression], show: bool = True, **mut_params) -> plotly.graph_objects.Figure

   Plots mutual information scores for all the categorical columns in the DataFrame in relation to the target column.
   The target column must be part of the provided DataFrame.
   :param df: DataFrame containing all columns including target column. Features are expected to be numerical.
   :param target: String indicating which column is the target column.
   :param class_problem: Any of ["binary", "multiclass", "regression"]
   :param show: Whether to display the plot
   :param mut_params: Dictionary passing additional arguments into sklearn's mutual_info_classif function.

   Returns:
   - plotly.graph_objects.Figure: The mutual information plot figure


.. py:function:: plot_ecdf(df: pandas.DataFrame, columns: List[Union[str, int, float]], plot_all_at_once: bool = False, show: bool = True) -> Union[plotly.graph_objects.Figure, List[plotly.graph_objects.Figure]]

   Plot the empirical cumulative density function (ECDF) and histogram.

   :param df: DataFrame containing all columns including target column. Features are expected to be numerical.
   :param columns: A list of column names to check.
   :param plot_all_at_once: If True, plot all eCDFs in one plot. If False, plot each eCDF separately.
   :param show: Whether to display the plot

   Returns:
   - plotly.graph_objects.Figure or List[plotly.graph_objects.Figure]: The ECDF figure(s)


.. py:function:: plot_distribution_by_time(df: pandas.DataFrame, col_to_plot: str, date_col: str, xlabel: str = 'Week', ylabel: str = 'Feature distribution', title: str = 'Weekly distribution of the feature', freq: str = 'W', show: bool = True) -> plotly.graph_objects.Figure

   Plot the distribution of a feature over time.

   :param df: Pandas DataFrame
   :param col_to_plot: String indicating which column to plot
   :param date_col: String indicating which column to use as date
   :param xlabel: String indicating the x-axis label
   :param ylabel: String indicating the y-axis label
   :param title: String indicating the title of the plot
   :param freq: Label indicating the frequency of the time grouping. Must be one of Pandas' Offset aliases.
   :param show: Whether to display the plot
   :return: plotly.graph_objects.Figure: The time distribution figure


.. py:function:: plot_error_distributions(df: pandas.DataFrame, target: str, prediction_error: str, num_cols_grid: int = 1, max_x_elements: int = 5) -> None

   Plots bivariate plots for each column in the dataframe with respect to the target.
   Each subplot represents unique values of the target column.
   The 'prediction_error' is plotted using unique values of the target column as the hue.
   Param num_cols_grid specifies how many columns the grid shall have.
   max_x_elements determines the maximum number of unique values on the x-axis per plot.


.. py:function:: plot_andrews_curve(df: pandas.DataFrame, target: str, n_samples: Optional[int] = 200, random_state=500, show: bool = True) -> plotly.graph_objects.Figure

   Plot Andrews curve.

   Andrews Curve helps visualize if there are inherent groupings of the numerical features based on a given grouping.

   :param df: Pandas DataFrame
   :param target: String indicating the target column
   :param n_samples: Int indicating how many samples shall be shown. If None, the full DataFrame is taken.
   :param random_state: Random seed determining the DataFrame sampling.
   :param show: Whether to display the plot
   :return: plotly.graph_objects.Figure: The Andrews curve figure


.. py:function:: plot_distribution_pairs(df1: pandas.DataFrame, df2: pandas.DataFrame, feature: str, palette: Optional[List[str]] = None, show: bool = True) -> plotly.graph_objects.Figure

   Compare distributions of two datasets for a given feature.

   Only the central 95% of the data is considered for the histogram.

   :param df1: DataFrame containing the feature.
   :param df2: Second DataFrame containing the feature for comparison.
   :param feature: String indicating the feature name
   :param palette: List of colors to use for the plots.
   :param show: Whether to display the plot

   Returns:
   - plotly.graph_objects.Figure: The distribution comparison figure


.. py:function:: plot_benfords_law(df: pandas.DataFrame, column: str, show: bool = True) -> plotly.graph_objects.Figure

   Plot Benford's Law analysis for a numerical column.

   Benford's Law states that in many naturally occurring datasets,
   the leading digit d (d ∈ {1, 2, ..., 9}) occurs with probability:
   P(d) = log10(1 + 1/d)

   This is useful for fraud detection and data quality analysis.

   :param df: DataFrame containing the data
   :param column: Name of the numerical column to analyze
   :param show: Whether to display the plot
   :return: plotly.graph_objects.Figure: The Benford's Law figure


.. py:function:: _create_gradient_bar_chart(value_counts: pandas.Series, column: str) -> plotly.graph_objects.Figure

   Create a beautiful gradient bar chart for category frequencies.

   :param value_counts: Series with category counts
   :param column: Column name for labeling
   :return: plotly.graph_objects.Figure with gradient bars


.. py:function:: plot_category_frequency(df: pandas.DataFrame, column: str, max_categories: int = 20, show: bool = True) -> plotly.graph_objects.Figure

   Create a beautiful category frequency visualization for categorical/text data.

   Uses gradient colors for enhanced visual appeal. Falls back from word cloud
   to gradient bar chart when WordCloud library is unavailable.

   :param df: DataFrame containing the data
   :param column: Name of the categorical/text column
   :param max_categories: Maximum number of categories to display
   :param show: Whether to display the plot
   :return: plotly.graph_objects.Figure: The category frequency figure


.. py:function:: plot_missing_values_matrix(df: pandas.DataFrame, show: bool = True) -> plotly.graph_objects.Figure

   Create a missing values matrix visualization.

   :param df: DataFrame to analyze
   :param show: Whether to display the plot
   :return: plotly.graph_objects.Figure: The missing values matrix figure


.. py:function:: _dashboard_update_plot(plot_type: str, selected_feature: str, df: pandas.DataFrame, numeric_cols: List[str], target_col: str)

   Helper function for dashboard plot updates.

   :param plot_type: Type of plot to create
   :param selected_feature: Selected feature for the plot
   :param df: DataFrame containing the data
   :param numeric_cols: List of numeric column names
   :param target_col: Target column name
   :return: Plotly figure object


.. py:function:: _dashboard_update_summary(selected_feature: str, df: pandas.DataFrame, numeric_cols: List[str], target_col: Optional[str] = None)

   Helper function for dashboard summary updates with dark theme styling.
   Shows statistics for both selected feature and target column.

   :param selected_feature: Selected feature for the summary
   :param df: DataFrame containing the data
   :param numeric_cols: List of numeric column names
   :param target_col: Target column name (optional)
   :return: HTML div with tables or string message


.. py:function:: _apply_pandas_query_filter(df: pandas.DataFrame, query_text: str) -> pandas.DataFrame

   Apply SQL-like filtering using pandas query syntax and operations.

   :param df: DataFrame to filter
   :param query_text: Query text (supports pandas query syntax or simple SQL-like syntax)
   :return: Filtered DataFrame


.. py:function:: _create_outlier_detection_plot(df: pandas.DataFrame, target_col: str, dark_theme_layout: dict, contamination: float = 0.1) -> plotly.graph_objects.Figure

   Create IsolationForest outlier detection plot showing outlier scores and top outliers.

   :param df: DataFrame containing the data
   :param target_col: Target column name
   :param dark_theme_layout: Dark theme layout configuration
   :param contamination: Expected proportion of outliers
   :return: Plotly figure


.. py:function:: _create_benford_plot(selected_feature_x: str, df: pandas.DataFrame, numeric_cols: List[str], dark_theme_layout: dict) -> plotly.graph_objects.Figure

   Create Benford's Law analysis plot for regression dashboard.


.. py:function:: _create_category_frequency_plot(selected_feature_x: str, df: pandas.DataFrame, dark_theme_layout: dict) -> plotly.graph_objects.Figure

   Create category frequency plot for regression dashboard.


.. py:function:: _create_violin_plot(selected_feature_x: str, df: pandas.DataFrame, target_col: str, dark_theme_layout: dict) -> plotly.graph_objects.Figure

   Create violin plot by target bins for regression dashboard.


.. py:function:: _create_theil_u_plot(df: pandas.DataFrame, target_col: str, dark_theme_layout: dict, is_regression: bool = True) -> plotly.graph_objects.Figure

   Create Theil U heatmap for categorical features including the target.


.. py:function:: _create_ecdf_plot(selected_feature_x: str, df: pandas.DataFrame, numeric_cols: List[str], dark_theme_layout: dict) -> plotly.graph_objects.Figure

   Create ECDF analysis plot for dashboard.


.. py:function:: _dashboard_update_regression_plot(plot_type: str, selected_feature_x: str, selected_feature_y: str, df: pandas.DataFrame, numeric_cols: List[str], target_col: str)

   Helper function for regression dashboard plot updates with dark theme styling.


.. py:function:: _create_benford_plot_classification(selected_feature_x: str, df: pandas.DataFrame, numeric_cols: List[str], dark_theme_layout: dict) -> plotly.graph_objects.Figure

   Create Benford's Law analysis plot for classification dashboard.


.. py:function:: _create_category_frequency_plot_classification(selected_feature_x: str, df: pandas.DataFrame, categorical_cols: List[str], dark_theme_layout: dict) -> plotly.graph_objects.Figure

   Create category frequency plot for classification dashboard.


.. py:function:: _dashboard_update_classification_plot(plot_type: str, selected_feature_x: str, selected_feature_y: str, df: pandas.DataFrame, numeric_cols: List[str], categorical_cols: List[str], target_col: str)

   Helper function for classification dashboard plot updates with dark theme styling.


.. py:function:: create_eda_dashboard_regression(df: pandas.DataFrame, target_col: str, port: int = 8050, run_server: bool = True, jupyter_mode: Optional[str] = None)

   Create a Dash dashboard for regression analysis with enhanced features.

   :param df: DataFrame to analyze
   :param target_col: Target column name (should be numeric for regression)
   :param port: Port number for the dashboard
   :param run_server: Whether to start the server (set to False for testing)
   :param jupyter_mode: Mode for Jupyter environments ("inline", "external", "tab", "jupyterlab")
                       If None, runs as regular server. For Kaggle/Colab use "external"


.. py:function:: create_eda_dashboard_classification(df: pandas.DataFrame, target_col: str, port: int = 8050, run_server: bool = True, jupyter_mode: Optional[str] = None)

   Create a Dash dashboard for classification analysis with enhanced features.

   :param df: DataFrame to analyze
   :param target_col: Target column name (should be categorical for classification)
   :param port: Port number for the dashboard
   :param run_server: Whether to start the server (set to False for testing)
   :param jupyter_mode: Mode for Jupyter environments ("inline", "external", "tab", "jupyterlab")
                       If None, runs as regular server. For Kaggle/Colab use "external"


.. py:function:: create_eda_dashboard(df: pandas.DataFrame, target_col: str, port: int = 8050, run_server: bool = True, jupyter_mode: Optional[str] = None)

   Create a Dash dashboard for exploratory data analysis.

   :param df: DataFrame to analyze
   :param target_col: Target column name
   :param port: Port number for the dashboard
   :param run_server: Whether to start the server (set to False for testing)
   :param jupyter_mode: Mode for Jupyter environments ("inline", "external", "tab", "jupyterlab")
                       If None, runs as regular server. For Kaggle/Colab use "external"