Module 3.3: Real-World Data Challenges

20 min Prerequisites: Previous modules

What You'll Learn

  1. Handling missing data properly
  2. Bootstrap confidence intervals
  3. Multiple testing correction (FDR)
  4. Checking stationarity assumptions
  5. Common pitfalls and solutions

Setup

import numpy as np
import matplotlib.pyplot as plt
from tigramite import data_processing as pp
from tigramite import plotting as tp
from tigramite.pcmci import PCMCI
from tigramite.independence_tests.parcorr import ParCorr
from tigramite.toymodels import structural_causal_processes as toys

Challenge 1: Missing Data

Real data has gaps. Tigramite handles this gracefully.

# Create data with missing values
np.random.seed(42)

def lin_f(x): return x
links = {
    0: [((0, -1), 0.7, lin_f)],
    1: [((1, -1), 0.6, lin_f), ((0, -1), 0.5, lin_f)],
    2: [((2, -1), 0.5, lin_f), ((1, -1), 0.4, lin_f)],
}
data, _ = toys.structural_causal_process(links, T=1000, seed=42)

# Introduce 10% missing values randomly
data_missing = data.copy()
missing_mask = np.random.random(data.shape) < 0.10
data_missing[missing_mask] = np.nan

print(f"Missing values: ~10%")
# Method: Use missing_flag parameter
dataframe = pp.DataFrame(
    data_missing,
    var_names=['X', 'Y', 'Z'],
    missing_flag=np.nan  # Tell Tigramite what marks missing values
)

# Run PCMCI - it will automatically handle missing values
parcorr = ParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=parcorr, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)

# PCMCI handles missing values automatically!
pcmci.print_significant_links(
    p_matrix=results['p_matrix'],
    val_matrix=results['val_matrix'],
    alpha_level=0.05
)

Challenge 2: Multiple Testing Correction

With many variables and lags, we perform MANY tests. This inflates false positives.

Solution: False Discovery Rate (FDR) correction

# Create clean data
data_clean, _ = toys.structural_causal_process(links, T=1000, seed=42)
dataframe = pp.DataFrame(data_clean, var_names=['X', 'Y', 'Z'])

parcorr = ParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=parcorr, verbosity=0)
results = pcmci.run_pcmci(tau_max=5, pc_alpha=0.05)

# Without correction
print("WITHOUT FDR correction (alpha=0.05):")
pcmci.print_significant_links(
    p_matrix=results['p_matrix'],
    val_matrix=results['val_matrix'],
    alpha_level=0.05
)

# With FDR correction
q_matrix = pcmci.get_corrected_pvalues(
    p_matrix=results['p_matrix'],
    tau_max=5,
    fdr_method='fdr_bh'  # Benjamini-Hochberg method
)

print("\nWITH FDR correction (q=0.05):")
pcmci.print_significant_links(
    p_matrix=q_matrix,
    val_matrix=results['val_matrix'],
    alpha_level=0.05
)

# FDR correction removes false positives!
Tip: Always apply FDR correction when analyzing data with many variables or large tau_max!

Challenge 3: Confidence in Results (Bootstrap)

How confident are we in the discovered links? Use bootstrap!

# Bootstrap confidence intervals
parcorr_boot = ParCorr(
    significance='analytic',
    confidence='bootstrap',  # Enable bootstrap
    conf_lev=0.95,           # 95% confidence interval
    conf_samples=100         # Number of bootstrap samples
)

pcmci_boot = PCMCI(dataframe=dataframe, cond_ind_test=parcorr_boot, verbosity=0)
results_boot = pcmci_boot.run_pcmci(tau_max=3, pc_alpha=0.05)

# The results now include confidence intervals
# This shows uncertainty in our estimates

print("Bootstrap confidence intervals for link strengths:")
for j in range(3):
    for i in range(3):
        for tau in range(4):
            if results_boot['graph'][j, i, tau] == '-->':
                val = results_boot['val_matrix'][j, i, tau]
                print(f"  {['X','Y','Z'][i]}(t-{tau}) → {['X','Y','Z'][j]}: {val:.3f}")

Challenge 4: Checking Stationarity

PCMCI assumes the causal structure is stationary (doesn't change over time).

Quick check: Does running PCMCI on different time windows give similar results?

# Check stationarity by comparing two halves of the data
data_first_half = data_clean[:500]
data_second_half = data_clean[500:]

# Run on first half
df1 = pp.DataFrame(data_first_half, var_names=['X', 'Y', 'Z'])
pcmci1 = PCMCI(dataframe=df1, cond_ind_test=ParCorr(), verbosity=0)
results1 = pcmci1.run_pcmci(tau_max=3, pc_alpha=0.05)

# Run on second half
df2 = pp.DataFrame(data_second_half, var_names=['X', 'Y', 'Z'])
pcmci2 = PCMCI(dataframe=df2, cond_ind_test=ParCorr(), verbosity=0)
results2 = pcmci2.run_pcmci(tau_max=3, pc_alpha=0.05)

# Compare graphs
print("First half of data:")
pcmci1.print_significant_links(
    p_matrix=results1['p_matrix'],
    val_matrix=results1['val_matrix'],
    alpha_level=0.01
)

print("\nSecond half of data:")
pcmci2.print_significant_links(
    p_matrix=results2['p_matrix'],
    val_matrix=results2['val_matrix'],
    alpha_level=0.01
)

# If results are similar → data is likely stationary
# If results differ → consider RPCMCI for regime changes

Common Pitfalls & Solutions

PitfallSymptomSolution
Too few samplesNoisy/unstable resultsNeed T > 500 typically
Wrong test choiceMissing linksCheck linearity, use CMIknn if unsure
tau_max too smallMissing lagged effectsUse lag function plot to choose
tau_max too largeToo many tests, false positivesUse FDR correction
Hidden confoundersSpurious linksTry LPCMCI
Non-stationarityInconsistent resultsUse RPCMCI or window analysis

Pre-Analysis Checklist

Before running PCMCI, always:

  1. Plot your data - tp.plot_timeseries()
  2. Check for missing values - Handle with missing_flag
  3. Check linearity - tp.plot_scatterplots()
  4. Choose tau_max - tp.plot_lagfuncs()
  5. Check stationarity - Compare different time windows
  6. Apply FDR correction - pcmci.get_corrected_pvalues()
  7. Validate with toy data - Test on known structure first