Module 2.1: Data Preparation Deep Dive

20 min Prerequisites: Foundations

What You'll Learn

  1. DataFrame creation and options
  2. Handling missing values
  3. Masking data (analyzing subsets)
  4. Working with multiple datasets
  5. Data type specification (continuous vs. discrete)

Basic DataFrame Creation

The DataFrame class is Tigramite's data container. At minimum, it needs a numpy array of shape (T, N).

# Method 1: Minimal - just the data
data = np.random.randn(500, 3)  # 500 time points, 3 variables
df = pp.DataFrame(data)

# Method 2: With variable names
df = pp.DataFrame(
    data,
    var_names=['Temperature', 'Pressure', 'Quality']
)

Handling Missing Values

Real data often has missing values. Tigramite can handle this!

Option 1: Mark with a flag value

# Data with missing values marked as -999
data_with_missing = np.random.randn(500, 3)
data_with_missing[100:110, 0] = -999  # Missing values

df = pp.DataFrame(
    data_with_missing,
    var_names=['Temp', 'Pressure', 'Quality'],
    missing_flag=-999  # Tell Tigramite what marks missing
)

Option 2: Use NaN

# Using NaN for missing values
data_with_nan = np.random.randn(500, 3)
data_with_nan[100:110, 0] = np.nan

df = pp.DataFrame(
    data_with_nan,
    missing_flag=np.nan
)

Masking Data: Analyze Subsets

Sometimes you want to analyze only certain time periods:

  • Only winter months
  • Only business hours
  • Exclude anomalies

Use a mask array where:

  • 0 = use this data point
  • 1 = exclude this data point
# Analyze only first and last 100 points
data = np.random.randn(500, 3)

# Create mask (same shape as data)
mask = np.zeros(data.shape, dtype='int32')
mask[100:400, :] = 1  # Mask out the middle

df = pp.DataFrame(data, mask=mask)

Practical Example: Winter-Only Analysis

# 2 years of daily data
T = 730
data = np.random.randn(T, 3)

# Day of year (1-365)
day_of_year = np.tile(np.arange(1, 366), 2)[:T]

# Winter = Dec-Feb (days 335-365 and 1-59)
is_winter = ((day_of_year >= 335) | (day_of_year <= 59))

mask = np.zeros((T, 3), dtype='int32')
mask[~is_winter, :] = 1  # Mask non-winter

df_winter = pp.DataFrame(data, mask=mask)

Multiple Datasets

Combine data from multiple sources (different experiments, locations).

# Data from 3 different sites
site_A = np.random.randn(200, 3)
site_B = np.random.randn(300, 3)  # Different length OK!
site_C = np.random.randn(150, 3)

multi_data = {
    0: site_A,
    1: site_B,
    2: site_C
}

df_multi = pp.DataFrame(
    multi_data,
    var_names=['Temp', 'Pressure', 'Quality'],
    analysis_mode='multiple'  # Important!
)

Data Types: Continuous vs. Discrete

Tigramite can handle mixed data.

# Temperature (continuous) and Machine State (discrete: 0, 1, 2)
temperature = np.random.randn(500)
pressure = np.random.randn(500)
machine_state = np.random.choice([0, 1, 2], size=500)

data = np.column_stack([temperature, pressure, machine_state])

# Specify data types: 0 = continuous, 1 = discrete
data_type = np.array([0, 0, 1])

df = pp.DataFrame(data, data_type=data_type)

Always Plot First!

ALWAYS visualize your data before running causal discovery.

# Plot time series
tp.plot_timeseries(dataframe, figsize=(12, 6))
plt.show()

# Check for linearity
tp.plot_scatterplots(dataframe=dataframe)
plt.show()

Summary: DataFrame Options

ParameterPurpose
dataYour numpy array (T, N)
var_namesVariable names
datatimeTime axis
missing_flagMissing value marker
maskWhich points to exclude
data_type0=continuous, 1=discrete
analysis_mode'multiple' for multiple datasets