Is Your Coaching Knowledge Consultant? A Information to Checking with PSI in Python

To get probably the most out of this tutorial, you need to have a stable understanding of easy methods to examine two distributions. If you happen to don’t, I like to recommend testing this glorious article by @matteo-courthoud.

We automated the evaluation and exported the outcomes to an Excel file utilizing Python. If you happen to already know the fundamentals of Python and easy methods to write to Excel, that can make issues even simpler.

I want to thank everybody who took the time to learn and have interaction with my article. Your assist and suggestions imply so much.

, whether or not tutorial or skilled, the query of information representativeness between two samples arises often.

By representativeness, we imply the diploma to which two samples resemble one another or share the identical traits. This idea is crucial, because it straight determines the accuracy of statistical conclusions or the efficiency of a predictive mannequin.

At every stage of a mannequin’s life cycle, the difficulty of information representativeness takes particular kinds :

Throughout the development part: that is the place all of it begins. You collect the information, clear it, break up it into coaching, check, and out-of-time samples, estimate the parameters, and punctiliously doc each choice. You make sure that the check and the out-of-time samples are consultant of the coaching knowledge.
In The appliance part: as soon as the mannequin is constructed, it have to be confronted with actuality. And right here an important query arises: do the brand new datasets actually resemble those used throughout building? If not, a lot of the earlier work might shortly lose its worth.
In the monitoring part, or backtesting: over time, populations evolve. The mannequin should due to this fact be commonly challenged. Do its predictions stay legitimate? Is the representativeness of the goal portfolio nonetheless ensured?

Representativeness is due to this fact not a one-off constraint, however a problem that accompanies the mannequin all through its growth.

To reply the query of representativeness between two samples, the most typical method is to match their distributions, proportions, and constructions. This includes the usage of visible instruments like density capabilities, histograms, boxplots, supplemented by statistical exams such because the Pupil’s t-test, the Kruskal-Wallis check, the Wilcoxon check, or the Kolmogorov-Smirnov check. On this topic, @matteo-courthoud has printed an important article, full with sensible codes, to which we refer the reader for additional data.

On this article, we’ll deal with two sensible instruments usually utilized in credit score danger administration to examine whether or not two datasets are comparable:

The Inhabitants Stability Index (PSI) exhibits how a lot a distribution shifts, both over time or between two samples.
Cramér’s V measures the energy of affiliation between classes, serving to us see if two populations share an identical construction.

We’ll then discover how these instruments can assist engineers and decision-makers by remodeling statistical comparisons into clear knowledge for quicker and extra dependable selections.

In Part 1 of this text, we current two concrete examples the place questions of representativeness between samples might come up. In Part 2, we consider representativeness between two datasets utilizing PSI and Cramér’s V. Lastly, in Part 3, we reveal easy methods to implement and automate these analyses in Python, exporting the outcomes into an Excel file.

1. Two real-world examples of the representativeness problem

The problem of representativeness turns into vital when a mannequin is utilized to a site aside from the one for which it was developed. Two typical conditions illustrate this problem:

1.1 When a mannequin is utilized to a new scope of shoppers

Think about a financial institution creating a scoring mannequin for small companies. The mannequin performs effectively and is acknowledged internally. Inspired by this success, the management decides to increase its use to giant companies. Your supervisor asks to your opinion on the method. What steps do you’re taking earlier than responding?

Because the growth and utility populations differ, utilizing the mannequin on the brand new inhabitants extends its scope. It’s due to this fact essential to substantiate that this utility is legitimate.

The statistician has a number of instruments to deal with this query, particularly representativeness evaluation evaluating the event inhabitants with the applying inhabitants. This may be carried out by analyzing their traits variable by variable, for instance by way of exams of imply equality, exams of distribution equality, or by evaluating the distribution of categorical variables.

1.2 When two banks merge and have to align their danger fashions

Now think about Financial institution A, a big establishment with a considerable stability sheet and a confirmed mannequin to evaluate shopper default danger. Financial institution A is finding out the potential for merging with Financial institution B. Financial institution B, nonetheless, operates in a weaker financial setting and has not developed its personal inner mannequin.

Suppose Financial institution A’s administration approaches you, because the statistician liable for its inner fashions. The strategic query is: would it not be acceptable to use Financial institution A’s inner fashions to Financial institution B’s portfolio within the occasion of a merger?

Earlier than making use of Financial institution A’s inner mannequin to Financial institution B’s portfolio, it’s essential to match the distributions of key variables throughout each portfolios. The mannequin can solely be transferred with confidence if the 2 populations are actually consultant of one another.

We’ve simply introduced two concrete circumstances the place verifying representativeness is crucial for sound decision-making. Within the subsequent part, we deal with easy methods to analyze representativeness between two portfolios by introducing two statistical instruments: the Inhabitants Stability Index (PSI) and Cramér’s V.

2. Evaluating Distributions to Assess Representativeness Between Two Populations Utilizing the Inhabitants Stability Index (PSI) and V-Cramer.

In follow, the research of representativeness between two datasets consists of evaluating the traits of the noticed variables in each samples. This comparability depends on each statistical measures and visible instruments.

From a statistical perspective, analysts usually look at measures of central tendency (imply, median) and dispersion (variance, customary deviation), in addition to extra granular indicators akin to quantiles.

On the visible aspect, frequent instruments embrace histograms, boxplots, cumulative distribution capabilities, density curves, and QQ-plots. These visualizations assist detect potential variations in form, location, or dispersion between two distributions.

Such graphical analyses present an important first step: they information the investigation and assist formulate hypotheses. Nevertheless, they have to be complemented by statistical exams to substantiate observations and attain rigorous conclusions. These exams embrace:

Parametric exams, akin to Pupil’s t-test (comparability of means),
Nonparametric exams, such because the Kolmogorov–Smirnov check (comparability of distributions), the chi-squared check (for categorical variables), and Welch’s check (for unequal variances).

These approaches are effectively introduced within the article by @matteo-courthoud. Past them, two indicators are notably related in credit score danger evaluation for assessing distributional drift between populations and supporting decision-making: the Inhabitants Stability Index (PSI) and Cramér’s V

2.1. The Inhabitants Stability Index (PSI)

The PSI is a elementary software within the credit score trade. It measures the distinction between two distributions of the identical variable:

for instance, between the coaching dataset and a newer utility dataset,
or between a reference dataset at time T₀and one other at time T₁.

In different phrases, the PSI quantifies how a lot a inhabitants has drifted over time or throughout completely different scopes.

Right here’s the way it works in follow:

For a categorical variable, we compute the proportion of observations in every class for each datasets.
For a steady variable, we first discretize it into bins. In follow, deciles are sometimes used to acquire a balanced distribution.

The PSI then compares, bin by bin, the proportions noticed within the reference dataset versus the goal dataset. The ultimate indicator aggregates these variations utilizing a logarithmic formulation:

Right here, pᵢ and qᵢ characterize the proportions in bin i for the reference dataset and the goal dataset, respectively. The PSI might be computed simply in an Excel file:

**Computation Framework for the Inhabitants Stability Index (PSI).**

The interpretation is very intuitive:

A smaller PSI means the 2 distributions are nearer.
A PSI of 0 means the distributions are equivalent.
A really giant PSI (tending towards infinity) means the 2 distributions are essentially completely different.

In follow, trade pointers usually use the next thresholds:

PSI < 0.1: the inhabitants is steady,
0.1 ≤ PSI < 0.25: the shift is noticeable—monitor carefully,
PSI ≥ 0.25: the shift is critical—the mannequin might not be dependable.

2.2. Cramér’s V

When assessing the representativeness of a categorical variable (or a discretized steady variable) between two datasets, a pure place to begin is the Chi-square check of independence.

We construct a contingency desk crossing:

the classes (modalities) of the variable of curiosity, and
an indicator variable for dataset membership (Dataset 1 / Dataset 2).

The check relies on the next statistic:

the place O_ij are the noticed counts and E_ij are the anticipated counts below the idea of independence.

Null speculation H₀: the variable has the identical distribution in each datasets (independence).
Various speculation H₁ : the distributions differ.

If H₀ is rejected, we conclude that the variable doesn’t observe the identical distribution throughout the 2 datasets.

Nevertheless, the Chi-square check has a serious limitation: it solely supplies a binary reply (reject / don’t reject), and its energy is very delicate to pattern dimension. With very giant datasets, even tiny variations can seem statistically important.

To handle this limitation, we use Cramér’s V, which rescales the Chi-square statistic to supply a normalized measure of affiliation bounded between 0 and 1:

the place n is the full pattern dimension, r is the variety of rows, and c is the variety of columns within the contingency desk.

The interpretation is intuitive:

V≈0 ⇒ The distributions are very comparable; representativeness is robust.
V→1 ⇒ The distinction between distributions is giant; the datasets are structurally completely different.

Not like the Chi-square check, which merely solutions “sure” or “no,” Cramér’s V supplies a graded measure of the energy of the distinction. This permits us to evaluate whether or not the distinction is negligible, reasonable, or substantial.

We use the identical thresholds as these utilized for the PSI to attract our conclusions. For the PSI and Cramér’s V indicators, if the distribution of a number of variables differs considerably between the 2 datasets, we conclude that they aren’t consultant.

3. Measuring Representativeness with PSI and Cramér’s V in Python.

In a earlier article, we utilized completely different variable choice strategies to scale back the Communities & Crime dataset to only 16 explanatory variables. This step was important to simplify the mannequin whereas protecting probably the most related data.
This dataset additionally features a variable known as fold, which splits the information into 10 subsamples. These folds are generally utilized in cross-validation: they permit us to check the robustness of a mannequin by coaching it on one a part of the information and validating it on one other. For cross-validation to be dependable, every fold must be consultant of the worldwide dataset:

To make sure legitimate efficiency estimates.
To forestall bias: a non-representative fold can distort mannequin outcomes
To assist generalization: consultant folds present a greater indication of how the mannequin will carry out on new knowledge.

On this instance, we’ll deal with checking whether or not fold 1 is consultant of the worldwide dataset utilizing our two indicators: PSI and Cramer’s V by evaluating the distribution of 16 variables throughout the 2 samples. We’ll proceed in two steps:

Step 1: Begin with the Goal Variable

We start with the goal variable. The thought is easy: examine its distribution between fold 1 and your entire dataset. To quantify this distinction, we’ll use two complementary indicators:

the Inhabitants Stability Index (PSI), which measures distributional shifts,
Cramér’s V, which measures the energy of affiliation between two categorical variables.

Step 2: Automating the Evaluation for All Variables

After illustrating the method with the goal, we prolong it to all options. We’ll construct a Python perform that computes PSI and Cramér’s V for every of the 16 explanatory variables, in addition to for the goal variable.

To make the outcomes straightforward to interpret, we’ll export every part into an Excel file with:

one sheet per variable, displaying the detailed comparability by phase,
a Abstract tab, aggregating outcomes throughout all variables.

3.1 Evaluating the goal variable `ViolentCrimesPerPop` between the worldwide dataset (reference) and fold 1 (goal)

Earlier than making use of statistical exams or constructing choice indicators, it’s important to conduct a descriptive and graphical evaluation. There are usually not simply formalities; they supply an early instinct concerning the variations between populations and assist decoding the outcomes. In follow, a well-chosen chart usually reveals the conclusions that indicators like PSI or Cramér’s V will later affirm (or problem).

For visualization, we proceed in three steps:

1. Evaluating steady distributions We start with graphical instruments akin to boxplots, cumulative distribution capabilities, and chance density plots. These visualizations present an intuitive method to look at variations within the goal variable’s distribution between the 2 datasets.

2. Discretization into quantiles Subsequent, we discretize the variable within the reference dataset utilizing quartiles (Q1, Q2, Q3, This autumn), which creates 5 lessons (Q1 by way of Q5). We then apply the very same cut-off factors to the goal dataset, guaranteeing that every commentary is mapped to intervals outlined from the reference. This ensures comparability between the 2 distributions.

3. Evaluating categorical distributions Lastly, as soon as the variable has been discretized, we are able to use visualization strategies fitted to categorical knowledge — akin to bar charts — to match how frequencies are distributed throughout the 2 datasets.

The method is dependent upon the kind of variable:

For a steady variable:

Begin with customary visualizations (boxplots, cumulative distributions, and density plots).
Subsequent, break up the variable into segments (Q1 to Q5) based mostly on the reference dataset’s quantiles.
Lastly, deal with these segments as classes and examine their distributions.

For a categorical variable:

No discretization is required — it’s already in categorical kind.
Go straight to evaluating class distributions, for instance with a bar chart.

The code beneath prepares the 2 datasets we need to examine after which visualizes the goal variable with a boxplot, displaying its distribution in each the worldwide dataset and in fold 1.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency, ks_2samp

knowledge = pd.read_csv("communities_data.csv")
# filter sur fold =1

data_ref = knowledge
data_target = knowledge[data["fold"] == 1]

# examine the 2 distribution of "ViolentCrimesPerPop" within the reference and goal datasets with boxplots



# Construct datasets with a "Group" column
df_ref = pd.DataFrame({
    "ViolentCrimesPerPop": data_ref["ViolentCrimesPerPop"],
    "Group": "Reference"
})

df_target = pd.DataFrame({
    "ViolentCrimesPerPop": data_target["ViolentCrimesPerPop"],
    "Group": "Goal"
})

# Merge them
df_all = pd.concat([df_ref, df_target])


plt.determine(figsize=(8, 6))

# Boxplot with each distributions overlayed
sns.boxplot(
    x="Group", 
    y="ViolentCrimesPerPop", 
    knowledge=df_all,
    palette="Set2",
    width=0.6,
    fliersize=3
)


# Add imply factors
means = df_all.groupby("Group")["ViolentCrimesPerPop"].imply()
for i, m in enumerate(means):
    plt.scatter(i, m, shade="purple", marker="D", s=50, zorder=3, label="Imply" if i == 0 else "")

# Title tells the story
plt.title("Violent Crimes Per Inhabitants by Group", fontsize=14, weight="daring")
plt.suptitle("Each teams present almost equivalent distributions", 
             fontsize=10, shade="grey")

plt.ylabel("Violent Crimes (Per Pop)", fontsize=12)
plt.xlabel("")

# Cleaner look
sns.despine()
plt.grid(axis="y", linestyle="--", alpha=0.5, seen=False)
plt.legend()

plt.present()


print(len(knowledge.columns))

The determine above means that each teams share comparable distributions for the ViolentCrimesPerPop variable. To take a more in-depth look, we are able to use Kernel Density Estimation (KDE) plots, which give a easy view of the underlying distribution and make it simpler to identify delicate variations.

plt.determine(figsize=(8, 6))

# KDE plots with higher styling
sns.kdeplot(
    knowledge=df_all,
    x="ViolentCrimesPerPop",
    hue="Group",
    fill=True,         # use shading for overlap
    alpha=0.4,         # transparency to indicate overlap
    common_norm=False,
    palette="Set2",
    linewidth=2
)

# KS-test for distribution distinction
g1 = df_all[df_all["Group"] == df_all["Group"].distinctive()[0]]["ViolentCrimesPerPop"]
g2 = df_all[df_all["Group"] == df_all["Group"].distinctive()[1]]["ViolentCrimesPerPop"]
stat, pval = ks_2samp(g1, g2)

# Add annotation
plt.textual content(df_all["ViolentCrimesPerPop"].imply(),
         plt.ylim()[1]*0.9,
         f"KS-test p-value = {pval:.3f}nNo important distinction noticed",
         ha="middle", fontsize=10, shade="black")

# Titles with story
plt.title("Kernel Density Estimation of Violent Crimes Per Inhabitants", fontsize=14, weight="daring")
plt.suptitle("Distributions overlap nearly utterly between teams", fontsize=10, shade="grey")

plt.xlabel("Violent Crimes (Per Pop)")
plt.ylabel("Density")

sns.despine()
plt.grid(False)
plt.present()

The KDE graph confirms that the 2 distributions are very comparable, displaying a excessive diploma of overlap. The Kolmogorov-Smirnov (KS) statistical check of 0.976 additionally signifies that there isn’t a important distinction between the 2 teams. To increase the evaluation, we are able to now look at the cumulative distribution of the goal variable.

# Cumulative distribution
plt.determine(figsize=(9, 6))
sns.histplot(
    knowledge=df_all,
    x="ViolentCrimesPerPop",
    hue="Group",
    stat="density",
    common_norm=False,
    fill=False,
    factor="step",
    bins=len(df_all),
    cumulative=True,
)

# Titles inform the story
plt.title("Cumulative Distribution of Violent Crimes Per Inhabitants", fontsize=14, weight="daring")
plt.suptitle("ECDFs overlap extensively; central tendencies are almost equivalent", fontsize=10)

# Labels & cleanup
plt.xlabel("Violent Crimes (Per Pop)")
plt.ylabel("Cumulative proportion")
plt.grid(seen=False)
plt.present()

The cumulative distribution plot supplies extra proof that the 2 teams are very comparable. The curves overlap nearly utterly, suggesting that their distributions are almost equivalent in each central tendency and unfold.

As a subsequent step, we’ll discretize the variable into quantiles within the reference dataset after which apply the identical cut-off factors to the goal dataset (fold 1). The code beneath demonstrates how to do that. Lastly, we’ll examine the ensuing distributions utilizing a bar chart.

def bin_numeric(ref, tgt, n_bins=5):
    """
    Discretize a numeric variable into quantile bins (ex: quintiles).
    - Quantile thresholds are computed solely on the reference dataset.
    - Prolong bins with -inf and +inf to cowl all potential values.
    - Returns:
        * ref binned
        * tgt binned
        * bin labels (Q1, Q2, ...)
    """
    edges = np.distinctive(ref.dropna().quantile(np.linspace(0, 1, n_bins + 1)).values)
    if len(edges) < 3:  # if variable is nearly fixed
        edges = np.array([-np.inf, np.inf])
    else:
        edges[0], edges[-1] = -np.inf, np.inf
    labels = [f"Q{i}" for i in range(1, len(edges))]
    return (
        pd.lower(ref, bins=edges, labels=labels, include_lowest=True),
        pd.lower(tgt, bins=edges, labels=labels, include_lowest=True),
        labels
    )

# Apply binning
ref_binned, tgt_binned, bin_labels = bin_numeric(data_ref["ViolentCrimesPerPop"], data_target["ViolentCrimesPerPop"], n_bins=5)




# Effectifs par phase pour Reference et Goal
ref_counts = ref_binned.value_counts().reindex(bin_labels, fill_value=0)
tgt_counts = tgt_binned.value_counts().reindex(bin_labels, fill_value=0)

# Convertir en proportions
ref_props = ref_counts / ref_counts.sum()
tgt_props = tgt_counts / tgt_counts.sum()

# Construire un DataFrame pour seaborn
df_props = pd.DataFrame({
    "Section": bin_labels,
    "Reference": ref_props.values,
    "Goal": tgt_props.values
})

# Restructurer en format lengthy
df_long = df_props.soften(id_vars="Section", 
                        value_vars=["Reference", "Target"], 
                        var_name="Supply", 
                        value_name="Proportion")

# Type sobre
sns.set_theme(model="whitegrid")

# Barplot avec proportions
plt.determine(figsize=(8,6))
sns.barplot(
    x="Section", y="Proportion", hue="Supply",
    knowledge=df_long, palette=["#4C72B0", "#55A868"]  # bleu & vert sobres
)

# Titre et légende
# Titles with story
plt.title("Proportion Comparability by Section (ViolentCrimesPerPop)", fontsize=14, weight="daring")
plt.suptitle("Throughout all quantile segments (Q1–Q5), proportions are almost equivalent", fontsize=10, shade="grey")

plt.xlabel("Quantile Section (Q1 - Q5)")
plt.ylabel("Proportion")
plt.legend(title="Dataset", loc="higher proper")
plt.grid(False)
plt.present()

As earlier than, we attain the identical conclusion: the distributions within the reference and goal datasets are very comparable. To maneuver past visible inspection, we’ll now compute the Inhabitants Stability Index (PSI) and Cramér’s V statistic. These metrics enable us to quantify the variations between distributions; each for all variables generally and for the goal variable ViolentCrimesPerPop particularly.

3.2 Automating the Evaluation for All Variables

As talked about earlier, the outcomes of the distribution comparisons for every variable between the 2 datasets, calculated utilizing PSI and Cramér’s V, are introduced in separate sheets inside a single Excel file.

As an instance, we start by analyzing the outcomes for the goal variable ViolentCrimesPerPop when evaluating the worldwide dataset (reference) with fold 1 (goal). The desk 1 beneath summarizes how each PSI and Cramér’s V are computed.

**Desk 1: PSI and Cramér’s V for *ViolentCrimesPerPop*: World Dataset (Reference) vs. Fold 1** (goal)

Since each PSI and Cramér’s V are beneath 0.1, we are able to conclude that the goal variable ViolentCrimesPerPop follows the identical distribution in each datasets.

The code that generated this desk is proven beneath. The identical code may also be used to supply outcomes for all variables and export them into an Excel file known as representativity.xlsx.

EPS = 1e-12  # A really small fixed to keep away from division by zero or log(0)

# ============================================================
# 1. Primary capabilities
# ============================================================

def safe_proportions(counts):
    """
    Convert uncooked counts into proportions in a secure means.
    - If the full rely = 0, return all zeros (to keep away from division by zero).
    - Clip values so no proportion is strictly 0 or 1 (numerical stability).
    """
    whole = counts.sum()
    if whole == 0:
        return np.zeros_like(counts, dtype=float)
    p = counts / whole
    return np.clip(p, EPS, 1.0)

def calculate_psi(p_ref, p_tgt):
    """
    Compute the Inhabitants Stability Index (PSI) between two distributions.

    PSI = sum( (p_ref - p_tgt) * log(p_ref / p_tgt) )

    Interpretation:
    - PSI < 0.1  → steady
    - 0.1–0.25   → reasonable shift
    - > 0.25     → main shift
    """
    p_ref = np.clip(p_ref, EPS, 1.0)
    p_tgt = np.clip(p_tgt, EPS, 1.0)
    return float(np.sum((p_ref - p_tgt) * np.log(p_ref / p_tgt)))

def calculate_cramers_v(contingency):
    """
    Compute Cramér's V statistic for affiliation between two categorical variables.
    - Enter: a 2 x Okay contingency desk (counts).
    - Makes use of Chi² check.
    - Normalizes the outcome to [0, 1].
      * 0   → no affiliation
      * 1   → good affiliation
    """
    chi2, _, _, _ = chi2_contingency(contingency, correction=False)
    n = contingency.sum()
    r, c = contingency.form
    if n == 0 or min(r - 1, c - 1) == 0:
        return 0.0
    return np.sqrt(chi2 / (n * (min(r - 1, c - 1))))

# ============================================================
# 2. Getting ready variables
# ============================================================

def bin_numeric(ref, tgt, n_bins=5):
    """
    Discretize a numeric variable into quantile bins (ex: quintiles).
    - Quantile thresholds are computed solely on the reference dataset.
    - Prolong bins with -inf and +inf to cowl all potential values.
    - Returns:
        * ref binned
        * tgt binned
        * bin labels (Q1, Q2, ...)
    """
    edges = np.distinctive(ref.dropna().quantile(np.linspace(0, 1, n_bins + 1)).values)
    if len(edges) < 3:  # if variable is nearly fixed
        edges = np.array([-np.inf, np.inf])
    else:
        edges[0], edges[-1] = -np.inf, np.inf
    labels = [f"Q{i}" for i in range(1, len(edges))]
    return (
        pd.lower(ref, bins=edges, labels=labels, include_lowest=True),
        pd.lower(tgt, bins=edges, labels=labels, include_lowest=True),
        labels
    )

def prepare_counts(ref, tgt, n_bins=5):
    """
    Put together frequency counts for one variable.
    - If numeric: discretize into quantile bins.
    - If categorical: take all classes current in both dataset.
    Returns:
      segments, counts in reference, counts in goal
    """
    if pd.api.sorts.is_numeric_dtype(ref) and pd.api.sorts.is_numeric_dtype(tgt):
        ref_b, tgt_b, labels = bin_numeric(ref, tgt, n_bins)
        segments = labels
    else:
        segments = sorted(set(ref.dropna().distinctive()) | set(tgt.dropna().distinctive()))
        ref_b, tgt_b = ref.astype(str), tgt.astype(str)

    ref_counts = ref_b.value_counts().reindex(segments, fill_value=0)
    tgt_counts = tgt_b.value_counts().reindex(segments, fill_value=0)
    return segments, ref_counts, tgt_counts

# ============================================================
# 3. Evaluation per variable
# ============================================================

def analyze_variable(ref, tgt, n_bins=5):
    """
    Analyze a single variable between two datasets.
    Steps:
    - Construct counts by phase (bin for numeric, class for categorical).
    - Compute PSI by phase and World PSI.
    - Compute Cramér's V from the contingency desk.
    - Return:
        DataFrame with particulars
        Abstract dictionary (psi, v_cramer)
    """
    segments, ref_counts, tgt_counts = prepare_counts(ref, tgt, n_bins)
    p_ref, p_tgt = safe_proportions(ref_counts.values), safe_proportions(tgt_counts.values)

    # PSI
    psi_global = calculate_psi(p_ref, p_tgt)
    psi_by_segment = (p_ref - p_tgt) * np.log(p_ref / p_tgt)

    # Cramér's V
    contingency = np.vstack([ref_counts.values, tgt_counts.values])
    v_cramer = calculate_cramers_v(contingency)

    # Construct detailed outcomes desk
    df = pd.DataFrame({
        "Section": segments,
        "Depend Reference": ref_counts.values,
        "Depend Goal": tgt_counts.values,
        "% Reference": p_ref,
        "% Goal": p_tgt,
        "PSI by Section": psi_by_segment
    })

    # Add abstract strains on the backside of the desk
    df.loc[len(df)] = ["Global PSI", np.nan, np.nan, np.nan, np.nan, psi_global]
    df.loc[len(df)] = ["Cramer's V", np.nan, np.nan, np.nan, np.nan, v_cramer]

    return df, {"psi": psi_global, "v_cramer": v_cramer}

# ============================================================
# 4. Excel reporting utilities
# ============================================================

def apply_traffic_light(ws, wb, first_row, last_row, col, low, excessive):
    """
    Apply conditional formatting (site visitors gentle colours) to a numeric column in Excel:
    - inexperienced  if worth < low
    - orange if low <= worth <= excessive
    - purple    if worth > excessive

    Word: first_row, last_row, and col are zero-based indices (xlsxwriter conference).
    """
    inexperienced  = wb.add_format({"bg_color": "#C6EFCE", "font_color": "#006100"})
    orange = wb.add_format({"bg_color": "#FCD5B4", "font_color": "#974706"})
    purple    = wb.add_format({"bg_color": "#FFC7CE", "font_color": "#9C0006"})

    if last_row < first_row:
        return  # nothing to paint

    ws.conditional_format(first_row, col, last_row, col,
        {"kind": "cell", "standards": "<", "worth": low, "format": inexperienced})
    ws.conditional_format(first_row, col, last_row, col,
        {"kind": "cell", "standards": "between", "minimal": low, "most": excessive, "format": orange})
    ws.conditional_format(first_row, col, last_row, col,
        {"kind": "cell", "standards": ">", "worth": excessive, "format": purple})

def representativity_report(ref_df, tgt_df, variables, output="representativity.xlsx",
                            n_bins=5, psi_thresholds=(0.10, 0.25),
                            v_thresholds=(0.10, 0.25), color_summary=True):
    """
    Construct a representativity report throughout a number of variables and export to Excel.

    For every variable:
      - Create a sheet with detailed PSI by phase, World PSI, and Cramer's V.
      - Apply site visitors gentle colours for simpler interpretation.

    Create one "Résumé" sheet with total World PSI and Cramer's V for all variables.
    """
    abstract = []

    with pd.ExcelWriter(output, engine="xlsxwriter") as author:
        wb = author.guide
        fmt_header = wb.add_format({"daring": True, "bg_color": "#0070C0",
                                    "font_color": "white", "align": "middle"})
        fmt_pct   = wb.add_format({"num_format": "0.00%"})
        fmt_ratio = wb.add_format({"num_format": "0.000"})
        fmt_int   = wb.add_format({"num_format": "0"})

        for var in variables:
            # Analyze variable
            df, meta = analyze_variable(ref_df[var], tgt_df[var], n_bins)
            sheet = var[:31]  # Excel sheet names are restricted to 31 characters
            df.to_excel(author, sheet_name=sheet, index=False)
            ws = author.sheets[sheet]

            # Format headers and columns
            for j, col in enumerate(df.columns):
                ws.write(0, j, col, fmt_header)
            ws.set_column(0, 0, 18)
            ws.set_column(1, 2, 16, fmt_int)
            ws.set_column(3, 4, 20, fmt_pct)
            ws.set_column(5, 5, 18, fmt_ratio)

            nrows = len(df)   # variety of knowledge rows (excluding header)
            col_psi = 5       # "PSI by Section" column index

            # PSI by Section rows
            apply_traffic_light(ws, wb, first_row=1, last_row=max(1, nrows-2),
                                col=col_psi, low=psi_thresholds[0], excessive=psi_thresholds[1])

            # World PSI row (second to final)
            apply_traffic_light(ws, wb, first_row=nrows-1, last_row=nrows-1,
                                col=col_psi, low=psi_thresholds[0], excessive=psi_thresholds[1])

            # Cramer's V row (final row) 
            apply_traffic_light(ws, wb, first_row=nrows, last_row=nrows,
                                col=col_psi, low=v_thresholds[0], excessive=v_thresholds[1])

            # Add abstract data for Résumé sheet
            abstract.append({"Variable": var,
                            "World PSI": meta["psi"],
                            "Cramer's V": meta["v_cramer"]})

        # Résumé sheet
        df_sum = pd.DataFrame(abstract)
        df_sum.to_excel(author, sheet_name="Résumé", index=False)
        ws = author.sheets["Résumé"]
        for j, col in enumerate(df_sum.columns):
            ws.write(0, j, col, fmt_header)
        ws.set_column(0, 0, 28)
        ws.set_column(1, 2, 16, fmt_ratio)

        # Apply site visitors gentle to abstract sheet
        if color_summary and len(df_sum) > 0:
            final = len(df_sum)
            # PSI column
            apply_traffic_light(ws, wb, 1, final, 1, psi_thresholds[0], psi_thresholds[1])
            # Cramer's V column
            apply_traffic_light(ws, wb, 1, final, 2, v_thresholds[0], v_thresholds[1])

    return output

# ============================================================
# Instance
# ============================================================

if __name__ == "__main__":
    # columns namees privées de fold
    columns = [x for x in data.columns if x != "fold"]

    # Generate the report
    path = representativity_report(data_ref, data_target, columns, output="representativity.xlsx")
    print(f" Report generated: {path}")

inally, Desk 2 exhibits the final sheet of the file, titled Abstract, which brings collectively the outcomes for all variables of curiosity.

**PSI and Cramér’s V abstract for *all variable*s: World Dataset vs. Fold 1**

This synthesis supplies an total view of representativeness between the 2 datasets, making interpretation and decision-making a lot simpler. Since each PSI and Cramér’s V are beneath 0.1, we are able to conclude that every one variables observe the identical distribution within the international dataset and in fold 1. Subsequently, fold 1 might be thought-about consultant of the worldwide dataset.

Conclusion

On this publish, we explored easy methods to research representativeness between two datasets by evaluating the distributions of their variables. We launched two key indicators Inhabitants stability index(PSI) and Cramér’s V, which might be each straightforward to make use of, straightforward to interpret, and extremely invaluable for decision-making.

We additionally confirmed how these analyses might be automated, with outcomes saved straight into an Excel file.

The principle takeaway is that this: in case you construct a mannequin and find yourself with overfitting, one potential purpose could also be that your coaching and check units are usually not consultant of one another. A easy method to forestall that is to all the time run a representativity evaluation between datasets. Variables that present representativity points can then information you in stratifying your knowledge when splitting it into coaching and check units. What about you? In what conditions do you research representativeness between two knowledge units, for what causes, and utilizing what strategies?

References

Yurdakul, B. (2018). Statistical properties of inhabitants stability index. Western Michigan College.

Redmond, M. (2002). Communities and Crime [Dataset]. UCI Machine Studying Repository. https://doi.org/10.24432/C53W3X.

Knowledge & Licensing

The dataset used on this article is licensed below the Artistic Commons Attribution 4.0 Worldwide (CC BY 4.0) license.

This license permits anybody to share and adapt the dataset for any function, together with business use, supplied that correct attribution is given to the supply.

For extra particulars, see the official license textual content: CC BY 4.0.

Disclaimer

I write to study so errors are the norm, though I attempt my greatest. Please, while you spot them, let me know. I additionally recognize recommendations on new matters!

Is Your Coaching Knowledge Consultant? A Information to Checking with PSI in Python

Admin

Surviving the AI Takeover in QA: Tips on how to Be part of the Prime 1%

Leave a Reply Cancel reply

Trending.

Safety Amplified: Audio’s Affect Speaks Volumes About Preventive Safety

Reconeyez Launches New Web site | SDM Journal

Apollo joins the Works With House Assistant Program

Discover Vibrant Spring 2025 Kitchen Decor Colours and Equipment – Chefio

Flip Your Toilet Right into a Good Oasis

TechTrendFeed

Categories

Recent News

Streamline entry to ISO-rating content material modifications with Verisk ranking insights and Amazon Bedrock

New Shai-hulud Worm Infecting npm Packages With Hundreds of thousands of Downloads

Is Your Coaching Knowledge Consultant? A Information to Checking with PSI in Python

1. Two real-world examples of the representativeness problem

1.1 When a mannequin is utilized to a new scope of shoppers

1.2 When two banks merge and have to align their danger fashions

2. Evaluating Distributions to Assess Representativeness Between Two Populations Utilizing the Inhabitants Stability Index (PSI) and V-Cramer.

2.1. The Inhabitants Stability Index (PSI)

2.2. Cramér’s V

3. Measuring Representativeness with PSI and Cramér’s V in Python.

Step 1: Begin with the Goal Variable

3.1 Evaluating the goal variable ViolentCrimesPerPop between the worldwide dataset (reference) and fold 1 (goal)

3.2 Automating the Evaluation for All Variables

Conclusion

References

Knowledge & Licensing

Disclaimer

Leave a Reply Cancel reply

Trending.

TechTrendFeed

Categories

Recent News

3.1 Evaluating the goal variable `ViolentCrimesPerPop` between the worldwide dataset (reference) and fold 1 (goal)