Module 3.3: Real-World Data Challenges

20 min Prerequisites: Previous modules

What You'll Learn

Handling missing data properly
Bootstrap confidence intervals
Multiple testing correction (FDR)
Checking stationarity assumptions
Common pitfalls and solutions

Setup

import numpy as np
import matplotlib.pyplot as plt
from tigramite import data_processing as pp
from tigramite import plotting as tp
from tigramite.pcmci import PCMCI
from tigramite.independence_tests.parcorr import ParCorr
from tigramite.toymodels import structural_causal_processes as toys

Challenge 1: Missing Data

Real data has gaps. Tigramite handles this gracefully.

# Create data with missing values
np.random.seed(42)

def lin_f(x): return x
links = {
    0: [((0, -1), 0.7, lin_f)],
    1: [((1, -1), 0.6, lin_f), ((0, -1), 0.5, lin_f)],
    2: [((2, -1), 0.5, lin_f), ((1, -1), 0.4, lin_f)],
}
data, _ = toys.structural_causal_process(links, T=1000, seed=42)

# Introduce 10% missing values randomly
data_missing = data.copy()
missing_mask = np.random.random(data.shape) < 0.10
data_missing[missing_mask] = np.nan

print(f"Missing values: ~10%")

# Method: Use missing_flag parameter
dataframe = pp.DataFrame(
    data_missing,
    var_names=['X', 'Y', 'Z'],
    missing_flag=np.nan  # Tell Tigramite what marks missing values
)

# Run PCMCI - it will automatically handle missing values
parcorr = ParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=parcorr, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)

# PCMCI handles missing values automatically!
pcmci.print_significant_links(
    p_matrix=results['p_matrix'],
    val_matrix=results['val_matrix'],
    alpha_level=0.05
)

Challenge 2: Multiple Testing Correction

With many variables and lags, we perform MANY tests. This inflates false positives.

Solution: False Discovery Rate (FDR) correction

# Create clean data
data_clean, _ = toys.structural_causal_process(links, T=1000, seed=42)
dataframe = pp.DataFrame(data_clean, var_names=['X', 'Y', 'Z'])

parcorr = ParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=parcorr, verbosity=0)
results = pcmci.run_pcmci(tau_max=5, pc_alpha=0.05)

# Without correction
print("WITHOUT FDR correction (alpha=0.05):")
pcmci.print_significant_links(
    p_matrix=results['p_matrix'],
    val_matrix=results['val_matrix'],
    alpha_level=0.05
)

# With FDR correction
q_matrix = pcmci.get_corrected_pvalues(
    p_matrix=results['p_matrix'],
    tau_max=5,
    fdr_method='fdr_bh'  # Benjamini-Hochberg method
)

print("\nWITH FDR correction (q=0.05):")
pcmci.print_significant_links(
    p_matrix=q_matrix,
    val_matrix=results['val_matrix'],
    alpha_level=0.05
)

# FDR correction removes false positives!

Tip: Always apply FDR correction when analyzing data with many variables or large tau_max!

Challenge 3: Confidence in Results (Bootstrap)

How confident are we in the discovered links? Use bootstrap!

# Bootstrap confidence intervals
parcorr_boot = ParCorr(
    significance='analytic',
    confidence='bootstrap',  # Enable bootstrap
    conf_lev=0.95,           # 95% confidence interval
    conf_samples=100         # Number of bootstrap samples
)

pcmci_boot = PCMCI(dataframe=dataframe, cond_ind_test=parcorr_boot, verbosity=0)
results_boot = pcmci_boot.run_pcmci(tau_max=3, pc_alpha=0.05)

# The results now include confidence intervals
# This shows uncertainty in our estimates

print("Bootstrap confidence intervals for link strengths:")
for j in range(3):
    for i in range(3):
        for tau in range(4):
            if results_boot['graph'][j, i, tau] == '-->':
                val = results_boot['val_matrix'][j, i, tau]
                print(f"  {['X','Y','Z'][i]}(t-{tau}) → {['X','Y','Z'][j]}: {val:.3f}")

Challenge 4: Checking Stationarity

PCMCI assumes the causal structure is stationary (doesn't change over time).

Quick check: Does running PCMCI on different time windows give similar results?

# Check stationarity by comparing two halves of the data
data_first_half = data_clean[:500]
data_second_half = data_clean[500:]

# Run on first half
df1 = pp.DataFrame(data_first_half, var_names=['X', 'Y', 'Z'])
pcmci1 = PCMCI(dataframe=df1, cond_ind_test=ParCorr(), verbosity=0)
results1 = pcmci1.run_pcmci(tau_max=3, pc_alpha=0.05)

# Run on second half
df2 = pp.DataFrame(data_second_half, var_names=['X', 'Y', 'Z'])
pcmci2 = PCMCI(dataframe=df2, cond_ind_test=ParCorr(), verbosity=0)
results2 = pcmci2.run_pcmci(tau_max=3, pc_alpha=0.05)

# Compare graphs
print("First half of data:")
pcmci1.print_significant_links(
    p_matrix=results1['p_matrix'],
    val_matrix=results1['val_matrix'],
    alpha_level=0.01
)

print("\nSecond half of data:")
pcmci2.print_significant_links(
    p_matrix=results2['p_matrix'],
    val_matrix=results2['val_matrix'],
    alpha_level=0.01
)

# If results are similar → data is likely stationary
# If results differ → consider RPCMCI for regime changes

Common Pitfalls & Solutions

Pitfall	Symptom	Solution
Too few samples	Noisy/unstable results	Need T > 500 typically
Wrong test choice	Missing links	Check linearity, use CMIknn if unsure
tau_max too small	Missing lagged effects	Use lag function plot to choose
tau_max too large	Too many tests, false positives	Use FDR correction
Hidden confounders	Spurious links	Try LPCMCI
Non-stationarity	Inconsistent results	Use RPCMCI or window analysis

Pre-Analysis Checklist

Before running PCMCI, always:

Plot your data - tp.plot_timeseries()
Check for missing values - Handle with missing_flag
Check linearity - tp.plot_scatterplots()
Choose tau_max - tp.plot_lagfuncs()
Check stationarity - Compare different time windows
Apply FDR correction - pcmci.get_corrected_pvalues()
Validate with toy data - Test on known structure first

← Previous

3.2 Causal Prediction

Case Study: Energy Analysis