# Explanatory analysis

BlueCast offers a simple way to get a first overview of the data.
Instead of writing many lines of code, BlueCast provides handy
functions to focus on the data rather than the implementation
of visualizations.

<!-- toc -->

* [Explanatory analysis](#explanatory-analysis)
  * [Feature type detection](#feature-type-detection)
  * [Pie chart](#pie-chart)
  * [Nulls per column](#nulls-per-column)
  * [Univariate plots](#univariate-plots)
  * [Empirical cumulative density function (eCDF)](#empirical-cumulative-density-function-ecdf)
  * [Bivariate plots](#bivariate-plots)
  * [Count pairs](#count-pairs)
  * [Plot distribution pairs](#plot-distribution-pairs)
  * [Classification target distribution in categorical features](#classification-target-distribution-in-categorical-features)
  * [Correlation to the target](#correlation-to-the-target)
  * [Correlation heatmap](#correlation-heatmap)
  * [Andrew Curve](#andrew-curve)
  * [Association of categorical features](#association-of-categorical-features)
  * [Mutual information](#mutual-information)
  * [Principal components analysis (PCA)](#principal-components-analysis-pca)
  * [PCA cumulative variance](#pca-cumulative-variance)
  * [t-SNE](#t-sne)
  * [Target leakage](#target-leakage)
  * [Feature distribution over time](#feature-distribution-over-time)

<!-- tocstop -->

## Feature type detection

Many datasets come still in the form of CSVs or have object type
due to the import from Excel files. With growing column size it
requires lots of time to study the features and cast them
appropriately. BlueCast offers a `FeatureTypeDetector` to automate
this task.
The same detector is also used as part of the BlueCast ml pipelines,
thus allowing users to check the results of this operation outside
of the pipeline.

```sh
from bluecast.eda.analyse import (
    plot_andrews_curve,
    bi_variate_plots,
    univariate_plots,
    plot_classification_target_distribution_within_categories,
    plot_count_pairs,
    plot_distribution_by_time,
    correlation_heatmap,
    correlation_to_target,
    plot_ecdf,
    plot_pca,
    plot_pca_biplot,
    plot_pca_cumulative_variance,
    plot_theil_u_heatmap,
    plot_tsne,
    check_unique_values,
    plot_null_percentage,
    mutual_info_to_target,
    plot_pie_chart,
)

from bluecast.preprocessing.feature_types import FeatureTypeDetector

# Here we automatically detect the numeric columns
feat_type_detector = FeatureTypeDetector()
train_data = feat_type_detector.fit_transform_feature_types(train_data)

# detect columns with a very high share of unique values
many_unique_cols = check_unique_values(train_data, train_data.columns.to_list())
```

## Pie chart

To show the distribution of target classes or categories pie charts
can be a viable option. Our implementation has been designed to be
visually appealing and insightful alike:

```sh
# plot the percentage of Nulls for all features
plot_pie_chart(
        synthetic_train_test_data[0],
        "categorical_feature_1",
    )
```

![Pie example](pie_chart.png)

## Nulls per column

Even though tree-based models like Xgboost can handle missing values
out-of-the-box it is still relevant to observe the distribution of
missing values. Here we offer a bar chart showing the percentage of
missing values per column.

```sh
# plot the percentage of Nulls for all features
plot_null_percentage(
    train_data.loc[:, feat_type_detector.num_columns],
    )
```

![NULLs example](plot_nulls.png)

## Univariate plots

The univariate plots function loops through the data and shows
histogram and boxplot for each numerical column.

```sh
# show univariate plots
univariate_plots(
        train_data.loc[:, feat_type_detector.num_columns],  # here the target column EC1 is already included
    )
```

![Univariate example](univariate_plots.png)

## Empirical cumulative density function (eCDF)

In some cases histograms might be misleading or boxplots cannot
show the distribution conveniently due to outliers. In such cases
eCDFs provide an alternative way to understand univariate
distributions. The plots can either be split by column or can
all be combined into one chart.

```sh
# show univariate plots
plot_ecdf(
        train_data, feat_type_detector.num_columns,  plot_all_at_once=True
    )
```

![ECDF example](ecdf.png)

## Bivariate plots

Bivariate plots are useful to understand how features differ
in regard to a discrete target (either classes or bins of
a continous target).

```sh
# show bi-variate plots
bi_variate_plots(
    train_data.loc[:, feat_type_detector.num_columns],
      "EC1"
      )
```

![Bivariate example](bivariate_plots.png)

## Count pairs

Count pairs are intended to compare the distribution of categories
between two datasets. This can be useful to check if an evaluation
dataset is representative or if data drift occurs.

```sh
# show bi-variate plots
plot_count_pairs(
    train,
    test,
    cat_cols=train_data.loc[:, feat_type_detector.cat_columns],
      )
```

![Count pairs example](pair_countplot.png)

## Plot distribution pairs

To compare the distribution of numerical features between two
datasets we can use the `plot_distribution_pairs` function.

```python
from bluecast.eda.analyse import plot_distribution_pairs

plot_distribution_pairs(
    train,
    test,
    feature='numerical_feature',
      )
```

![Plot distribution pairs example](plot_distribution_pairs.png)

## Classification target distribution in categorical features

We might also want to see how target classes are distributed
within categorical features. This can be plotted with:

```python
from bluecast.eda.analyse import plot_classification_target_distribution_within_categories

plot_classification_target_distribution_within_categories(
    train,
    cat_columns=train_data.loc[:, feat_type_detector.cat_columns],
    target_col="target"
      )
```

![Target class distro example](class_target_distribution.png)

## Correlation to the target

For feature selection it might be useful to understand how much
each feature can explain the target variable. To capture linear
signal the correlation uses Pearson's r to indicate that.

```sh
# show correlation to target
correlation_to_target(train_data.loc[:, feat_type_detector.num_columns], "target")
```

![Corr to target example](correlation_to_target.png)

## Correlation to target via scatterplots

For regression tasks we can also use scatterplots to investigate
the relationships of numerical columns to the target variable.

```sh
# show correlation to target
plot_against_target_for_regression(
    train_data,
    feat_type_detector.num_columns,
    "target"
)
```

![Corr to target via scatterplots example](scatterplots_against_target.png)

## Correlation heatmap

The correlation heatmap however shows the linear relationships
between features and reveals multicollinearity if present.

```sh
# show correlation heatmap
correlation_heatmap(train_data.loc[:, feat_type_detector.num_columns])
```

![Corr heatmap example](correlation_heatmap.png)

## Andrew Curve

Andrews curve brings the data into a lower space by retaining
the relative distance between other samples and keeping the
variance similar. We can show how similar samples are with
regards to the same output.

```python
from bluecast.eda.analyse import plot_andrews_curve

plot_andrews_curve(
  train_data.loc[:, feat_type_detector.num_columns],
  "target",
  n_samples=20,
  random_state=20
)
```

![Andrew curve example](andrew_curve.png)

## Association of categorical features

The correlation heatmap requires numerical features. For categories
we make use of Theil's U to build an association heatmap.

```sh
# show a heatmap of assocations between categorical variables
theil_matrix = plot_theil_u_heatmap(train_data, feat_type_detector.cat_columns)
```

![Theil U example](theil_u_matrix.png)

## Mutual information

To capture nonlinear information we can use the mutual information
score. This function has a parameter `class_problem`, that indicates if
it shall be calculated for classification (`binary` or `multiclass`) or
`regression`.

```sh
# show mutual information of categorical features to target
# features are expected to be numerical format
# class problem can be any of "binary", "multiclass" or "regression"
extra_params = {"random_state": 30}
mutual_info_to_target(train_data.loc[:, feat_type_detector.num_columns], "EC1", class_problem="binary", **extra_params)
```

![MI example](mutual_information.png)

## Principal components analysis (PCA)

How does our feature space look like if we condense the data into
a two-dimensional linear space? Can classes be easily separated?
The `plot_pca` function shows exactly that.

```sh
## show feature space after principal component analysis
plot_pca(
    train_data.loc[:, feat_type_detector.num_columns],
    "target"
    )
```

![PCA example](plot_pca.png)

## PCA Biplot

We might be interested to see which feature contributes to which
principal component and by how much. For this purpose the
`plot_pca_biplot` function can be used:

```python
from bluecast.eda.analyse import plot_pca_biplot

plot_pca_biplot(
  train_data.loc[:, feat_type_detector.num_columns],
  "target"
)
```

![PCA Biplot example](pca_biplot.png)

## PCA cumulative variance

Sometimes we want to know how many principal componts we would
need to capture a certain percentage of the dataset's variance.
This can be plotted via:

```sh
## show how many components are needed to explain certain variance
plot_pca_cumulative_variance(
    train_data.loc[:, feat_type_detector.num_columns],
    "target"
    )
```

![PCA cumulative example](plot_cumulative_pca_variance.png)

## t-SNE

While PCA captures linear signals, t-SNE also captures the non-linear
information. The `perplexity` parameter needs to be tuned though. Be
aware, that this plot can be very slow to compute depending on data
and `perplexity`.

```sh
# show feature space after t-SNE
plot_tsne(
    train_data.loc[:, feat_type_detector.num_columns],
    "target",
    perplexity=30,
    random_state=0
    )
```

![TSNE example](t_sne_plot.png)

## Target leakage

With big data and complex pipelines target leakage can easily sneak in.
To detect leakage BlueCast offers two functions:

```sh
from bluecast.eda.data_leakage_checks import (
    detect_categorical_leakage,
    detect_leakage_via_correlation,
)


# Detect leakage of numeric columns based on correlation
result = detect_leakage_via_correlation(
        train_data.loc[:, feat_type_detector.num_columns], "target", threshold=0.9
    )

# Detect leakage of categorical columns based on Theil's U
result = detect_categorical_leakage(
        train_data.loc[:, feat_type_detector.cat_columns], "target", threshold=0.9
    )
```

## Feature distribution over time

With the presence of timestamps we often want to understand how the
distribution behaves over time. Does it change? Is there a trend?
For this BlueCast offers the `plot_distribution_by_time` function.

```python
from bluecast.eda.analyse import plot_distribution_by_time

plot_distribution_by_time(train_data, "num_column", "created_at")
```

![Distribution over time example](distribution_over_time.png)