Module 2.2: Choosing the Right Independence Test
What You'll Learn
- Why the test choice matters
- Decision tree for selecting the right test
- Practical examples of each test
- Speed vs. accuracy tradeoffs
Why Does the Test Matter?
The independence test determines HOW Tigramite checks if two variables are related.
| Wrong Test | Problem |
|---|---|
| Linear test on nonlinear data | Misses relationships |
| Continuous test on categorical | Invalid results |
| Complex test on simple data | Wastes time |
Bottom line: Match your test to your data!
The Decision Tree
Is your data CONTINUOUS?
│
├── YES: Are relationships LINEAR?
│ │
│ ├── YES: Is noise Gaussian?
│ │ ├── YES → ParCorr (fastest)
│ │ └── NO → RobustParCorr
│ │
│ └── NO (nonlinear):
│ ├── Additive nonlinear? → GPDC
│ └── General nonlinear? → CMIknn (most flexible)
│
└── NO (discrete/categorical):
├── Single category variable? → Gsquared
└── Multiple category variables? → CMIsymb
Mixed continuous + discrete? → RegressionCI or CMIknnMixed
Test 1: ParCorr (Partial Correlation)
Use when: Linear relationships, Gaussian noise
Pros: Very fast, well-understood statistics
Cons: Misses nonlinear relationships
from tigramite.independence_tests.parcorr import ParCorr
# ParCorr is perfect for LINEAR data
np.random.seed(42)
T = 500
# Create linear relationships
X = np.zeros(T)
Y = np.zeros(T)
for t in range(1, T):
X[t] = 0.7 * X[t-1] + np.random.randn() # Autocorrelation
Y[t] = 0.5 * X[t-1] + np.random.randn() # X causes Y (linear!)
data = np.column_stack([X, Y])
dataframe = pp.DataFrame(data, var_names=['X', 'Y'])
# Use ParCorr
parcorr = ParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=parcorr, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)
print("ParCorr detected:")
pcmci.print_significant_links(p_matrix=results['p_matrix'],
val_matrix=results['val_matrix'], alpha_level=0.05)
Test 2: RobustParCorr
Use when: Linear relationships, but NON-Gaussian noise (heavy tails, skewed)
Pros: Handles weird distributions
Cons: Slightly slower than ParCorr
from tigramite.independence_tests.robust_parcorr import RobustParCorr
# RobustParCorr handles non-Gaussian marginals
np.random.seed(42)
T = 500
# Create linear relationships with EXPONENTIAL noise (non-Gaussian!)
X = np.zeros(T)
Y = np.zeros(T)
for t in range(1, T):
X[t] = 0.7 * X[t-1] + np.random.exponential(1) # Exponential noise!
Y[t] = 0.5 * X[t-1] + np.random.exponential(1)
data = np.column_stack([X, Y])
dataframe = pp.DataFrame(data, var_names=['X', 'Y'])
# Use RobustParCorr
robust = RobustParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=robust, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)
Test 3: CMIknn (Conditional Mutual Information)
Use when: Nonlinear relationships, continuous data
Pros: Catches ANY type of dependency
Cons: Slower, needs more data (T > 500)
from tigramite.independence_tests.cmiknn import CMIknn
# CMIknn catches NONLINEAR relationships
np.random.seed(42)
T = 1000 # Need more data for nonparametric tests
# Create NONLINEAR relationships
X = np.zeros(T)
Y = np.zeros(T)
for t in range(1, T):
X[t] = 0.7 * X[t-1] + np.random.randn() * 0.5
Y[t] = np.sin(X[t-1]) + np.random.randn() * 0.3 # NONLINEAR: sin(X)!
data = np.column_stack([X, Y])
dataframe = pp.DataFrame(data, var_names=['X', 'Y'])
# Use CMIknn
cmiknn = CMIknn(significance='shuffle_test', knn=0.1) # knn as fraction
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=cmiknn, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)
# CMIknn will detect the nonlinear relationship that ParCorr might miss!
Important: CMIknn uses
significance='shuffle_test' which is slower but necessary for nonparametric testing.
Test 4: Gsquared (for Discrete Data)
Use when: Variables are categories (e.g., Low/Medium/High, On/Off)
Pros: Designed for categorical data
Cons: Only for discrete variables
from tigramite.independence_tests.gsquared import Gsquared
# Gsquared for categorical data
np.random.seed(42)
T = 1000
# Create categorical data (0, 1, 2 representing Low, Medium, High)
X = np.zeros(T, dtype=int)
Y = np.zeros(T, dtype=int)
for t in range(1, T):
X[t] = np.random.choice([0, 1, 2])
# Y depends on X from previous time step
if X[t-1] == 0:
Y[t] = np.random.choice([0, 1, 2], p=[0.7, 0.2, 0.1])
elif X[t-1] == 1:
Y[t] = np.random.choice([0, 1, 2], p=[0.2, 0.6, 0.2])
else:
Y[t] = np.random.choice([0, 1, 2], p=[0.1, 0.2, 0.7])
data = np.column_stack([X, Y]).astype(float)
dataframe = pp.DataFrame(data, var_names=['MachineState', 'Quality'])
# Use Gsquared
gsquared = Gsquared(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=gsquared, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)
Quick Reference
| Test | Data | Relationships | Speed |
|---|---|---|---|
| ParCorr | Continuous | Linear | Fast |
| RobustParCorr | Continuous | Linear, non-Gaussian | Fast |
| CMIknn | Continuous | Nonlinear | Slow |
| GPDC | Continuous | Additive nonlinear | Medium |
| Gsquared | Discrete | Any | Fast |
| CMIsymb | Discrete | Multivariate | Medium |
| RegressionCI | Mixed | Linear + discrete | Medium |
How to Import Each Test
# Linear tests (fast)
from tigramite.independence_tests.parcorr import ParCorr
from tigramite.independence_tests.robust_parcorr import RobustParCorr
# Nonlinear tests (flexible but slower)
from tigramite.independence_tests.cmiknn import CMIknn
from tigramite.independence_tests.gpdc import GPDC
# Discrete/categorical tests
from tigramite.independence_tests.gsquared import Gsquared
from tigramite.independence_tests.cmisymb import CMIsymb
# Mixed data
from tigramite.independence_tests.regressionCI import RegressionCI
Significance Methods
The significance parameter controls how p-values are computed:
'analytic'- Fast, uses mathematical formulas (ParCorr, Gsquared)'shuffle_test'- Slower, uses permutation testing (CMIknn) - more flexible but computationally intensive
Practical Tips
- Start with ParCorr - It's fast and works well for many real-world datasets
- Check linearity first - Use
tp.plot_scatterplots()to visualize relationships - Use CMIknn when unsure - It's the most flexible (but slowest)
- Need more data for nonparametric tests - CMIknn typically needs T > 500
- Match test to data type - Gsquared for discrete, ParCorr/CMIknn for continuous