Module 3.3: Real-World Data Challenges
What You'll Learn
- Handling missing data properly
- Bootstrap confidence intervals
- Multiple testing correction (FDR)
- Checking stationarity assumptions
- Common pitfalls and solutions
Setup
import numpy as np
import matplotlib.pyplot as plt
from tigramite import data_processing as pp
from tigramite import plotting as tp
from tigramite.pcmci import PCMCI
from tigramite.independence_tests.parcorr import ParCorr
from tigramite.toymodels import structural_causal_processes as toys
Challenge 1: Missing Data
Real data has gaps. Tigramite handles this gracefully.
# Create data with missing values
np.random.seed(42)
def lin_f(x): return x
links = {
0: [((0, -1), 0.7, lin_f)],
1: [((1, -1), 0.6, lin_f), ((0, -1), 0.5, lin_f)],
2: [((2, -1), 0.5, lin_f), ((1, -1), 0.4, lin_f)],
}
data, _ = toys.structural_causal_process(links, T=1000, seed=42)
# Introduce 10% missing values randomly
data_missing = data.copy()
missing_mask = np.random.random(data.shape) < 0.10
data_missing[missing_mask] = np.nan
print(f"Missing values: ~10%")
# Method: Use missing_flag parameter
dataframe = pp.DataFrame(
data_missing,
var_names=['X', 'Y', 'Z'],
missing_flag=np.nan # Tell Tigramite what marks missing values
)
# Run PCMCI - it will automatically handle missing values
parcorr = ParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=parcorr, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)
# PCMCI handles missing values automatically!
pcmci.print_significant_links(
p_matrix=results['p_matrix'],
val_matrix=results['val_matrix'],
alpha_level=0.05
)
Challenge 2: Multiple Testing Correction
With many variables and lags, we perform MANY tests. This inflates false positives.
Solution: False Discovery Rate (FDR) correction
# Create clean data
data_clean, _ = toys.structural_causal_process(links, T=1000, seed=42)
dataframe = pp.DataFrame(data_clean, var_names=['X', 'Y', 'Z'])
parcorr = ParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=parcorr, verbosity=0)
results = pcmci.run_pcmci(tau_max=5, pc_alpha=0.05)
# Without correction
print("WITHOUT FDR correction (alpha=0.05):")
pcmci.print_significant_links(
p_matrix=results['p_matrix'],
val_matrix=results['val_matrix'],
alpha_level=0.05
)
# With FDR correction
q_matrix = pcmci.get_corrected_pvalues(
p_matrix=results['p_matrix'],
tau_max=5,
fdr_method='fdr_bh' # Benjamini-Hochberg method
)
print("\nWITH FDR correction (q=0.05):")
pcmci.print_significant_links(
p_matrix=q_matrix,
val_matrix=results['val_matrix'],
alpha_level=0.05
)
# FDR correction removes false positives!
Tip: Always apply FDR correction when analyzing data with many variables or large tau_max!
Challenge 3: Confidence in Results (Bootstrap)
How confident are we in the discovered links? Use bootstrap!
# Bootstrap confidence intervals
parcorr_boot = ParCorr(
significance='analytic',
confidence='bootstrap', # Enable bootstrap
conf_lev=0.95, # 95% confidence interval
conf_samples=100 # Number of bootstrap samples
)
pcmci_boot = PCMCI(dataframe=dataframe, cond_ind_test=parcorr_boot, verbosity=0)
results_boot = pcmci_boot.run_pcmci(tau_max=3, pc_alpha=0.05)
# The results now include confidence intervals
# This shows uncertainty in our estimates
print("Bootstrap confidence intervals for link strengths:")
for j in range(3):
for i in range(3):
for tau in range(4):
if results_boot['graph'][j, i, tau] == '-->':
val = results_boot['val_matrix'][j, i, tau]
print(f" {['X','Y','Z'][i]}(t-{tau}) → {['X','Y','Z'][j]}: {val:.3f}")
Challenge 4: Checking Stationarity
PCMCI assumes the causal structure is stationary (doesn't change over time).
Quick check: Does running PCMCI on different time windows give similar results?
# Check stationarity by comparing two halves of the data
data_first_half = data_clean[:500]
data_second_half = data_clean[500:]
# Run on first half
df1 = pp.DataFrame(data_first_half, var_names=['X', 'Y', 'Z'])
pcmci1 = PCMCI(dataframe=df1, cond_ind_test=ParCorr(), verbosity=0)
results1 = pcmci1.run_pcmci(tau_max=3, pc_alpha=0.05)
# Run on second half
df2 = pp.DataFrame(data_second_half, var_names=['X', 'Y', 'Z'])
pcmci2 = PCMCI(dataframe=df2, cond_ind_test=ParCorr(), verbosity=0)
results2 = pcmci2.run_pcmci(tau_max=3, pc_alpha=0.05)
# Compare graphs
print("First half of data:")
pcmci1.print_significant_links(
p_matrix=results1['p_matrix'],
val_matrix=results1['val_matrix'],
alpha_level=0.01
)
print("\nSecond half of data:")
pcmci2.print_significant_links(
p_matrix=results2['p_matrix'],
val_matrix=results2['val_matrix'],
alpha_level=0.01
)
# If results are similar → data is likely stationary
# If results differ → consider RPCMCI for regime changes
Common Pitfalls & Solutions
| Pitfall | Symptom | Solution |
|---|---|---|
| Too few samples | Noisy/unstable results | Need T > 500 typically |
| Wrong test choice | Missing links | Check linearity, use CMIknn if unsure |
| tau_max too small | Missing lagged effects | Use lag function plot to choose |
| tau_max too large | Too many tests, false positives | Use FDR correction |
| Hidden confounders | Spurious links | Try LPCMCI |
| Non-stationarity | Inconsistent results | Use RPCMCI or window analysis |
Pre-Analysis Checklist
Before running PCMCI, always:
- Plot your data -
tp.plot_timeseries() - Check for missing values - Handle with
missing_flag - Check linearity -
tp.plot_scatterplots() - Choose tau_max -
tp.plot_lagfuncs() - Check stationarity - Compare different time windows
- Apply FDR correction -
pcmci.get_corrected_pvalues() - Validate with toy data - Test on known structure first