Module 2.1: Data Preparation Deep Dive

20 min Prerequisites: Foundations

What You'll Learn

DataFrame creation and options
Handling missing values
Masking data (analyzing subsets)
Working with multiple datasets
Data type specification (continuous vs. discrete)

Basic DataFrame Creation

The DataFrame class is Tigramite's data container. At minimum, it needs a numpy array of shape (T, N).

# Method 1: Minimal - just the data
data = np.random.randn(500, 3)  # 500 time points, 3 variables
df = pp.DataFrame(data)

# Method 2: With variable names
df = pp.DataFrame(
    data,
    var_names=['Temperature', 'Pressure', 'Quality']
)

Handling Missing Values

Real data often has missing values. Tigramite can handle this!

Option 1: Mark with a flag value

# Data with missing values marked as -999
data_with_missing = np.random.randn(500, 3)
data_with_missing[100:110, 0] = -999  # Missing values

df = pp.DataFrame(
    data_with_missing,
    var_names=['Temp', 'Pressure', 'Quality'],
    missing_flag=-999  # Tell Tigramite what marks missing
)

Option 2: Use NaN

# Using NaN for missing values
data_with_nan = np.random.randn(500, 3)
data_with_nan[100:110, 0] = np.nan

df = pp.DataFrame(
    data_with_nan,
    missing_flag=np.nan
)

Masking Data: Analyze Subsets

Sometimes you want to analyze only certain time periods:

Only winter months
Only business hours
Exclude anomalies

Use a mask array where:

0 = use this data point
1 = exclude this data point

# Analyze only first and last 100 points
data = np.random.randn(500, 3)

# Create mask (same shape as data)
mask = np.zeros(data.shape, dtype='int32')
mask[100:400, :] = 1  # Mask out the middle

df = pp.DataFrame(data, mask=mask)

Practical Example: Winter-Only Analysis

# 2 years of daily data
T = 730
data = np.random.randn(T, 3)

# Day of year (1-365)
day_of_year = np.tile(np.arange(1, 366), 2)[:T]

# Winter = Dec-Feb (days 335-365 and 1-59)
is_winter = ((day_of_year >= 335) | (day_of_year <= 59))

mask = np.zeros((T, 3), dtype='int32')
mask[~is_winter, :] = 1  # Mask non-winter

df_winter = pp.DataFrame(data, mask=mask)

Multiple Datasets

Combine data from multiple sources (different experiments, locations).

# Data from 3 different sites
site_A = np.random.randn(200, 3)
site_B = np.random.randn(300, 3)  # Different length OK!
site_C = np.random.randn(150, 3)

multi_data = {
    0: site_A,
    1: site_B,
    2: site_C
}

df_multi = pp.DataFrame(
    multi_data,
    var_names=['Temp', 'Pressure', 'Quality'],
    analysis_mode='multiple'  # Important!
)

Data Types: Continuous vs. Discrete

Tigramite can handle mixed data.

# Temperature (continuous) and Machine State (discrete: 0, 1, 2)
temperature = np.random.randn(500)
pressure = np.random.randn(500)
machine_state = np.random.choice([0, 1, 2], size=500)

data = np.column_stack([temperature, pressure, machine_state])

# Specify data types: 0 = continuous, 1 = discrete
data_type = np.array([0, 0, 1])

df = pp.DataFrame(data, data_type=data_type)

Always Plot First!

ALWAYS visualize your data before running causal discovery.

# Plot time series
tp.plot_timeseries(dataframe, figsize=(12, 6))
plt.show()

# Check for linearity
tp.plot_scatterplots(dataframe=dataframe)
plt.show()

Summary: DataFrame Options

Parameter	Purpose
`data`	Your numpy array (T, N)
`var_names`	Variable names
`datatime`	Time axis
`missing_flag`	Missing value marker
`mask`	Which points to exclude
`data_type`	0=continuous, 1=discrete
`analysis_mode`	`'multiple'` for multiple datasets

← Previous

1.3 First Workflow

2.2 Choosing Tests