Module 2.1: Data Preparation Deep Dive
What You'll Learn
- DataFrame creation and options
- Handling missing values
- Masking data (analyzing subsets)
- Working with multiple datasets
- Data type specification (continuous vs. discrete)
Basic DataFrame Creation
The DataFrame class is Tigramite's data container. At minimum, it needs a numpy array of shape (T, N).
# Method 1: Minimal - just the data
data = np.random.randn(500, 3) # 500 time points, 3 variables
df = pp.DataFrame(data)
# Method 2: With variable names
df = pp.DataFrame(
data,
var_names=['Temperature', 'Pressure', 'Quality']
)
Handling Missing Values
Real data often has missing values. Tigramite can handle this!
Option 1: Mark with a flag value
# Data with missing values marked as -999
data_with_missing = np.random.randn(500, 3)
data_with_missing[100:110, 0] = -999 # Missing values
df = pp.DataFrame(
data_with_missing,
var_names=['Temp', 'Pressure', 'Quality'],
missing_flag=-999 # Tell Tigramite what marks missing
)
Option 2: Use NaN
# Using NaN for missing values
data_with_nan = np.random.randn(500, 3)
data_with_nan[100:110, 0] = np.nan
df = pp.DataFrame(
data_with_nan,
missing_flag=np.nan
)
Masking Data: Analyze Subsets
Sometimes you want to analyze only certain time periods:
- Only winter months
- Only business hours
- Exclude anomalies
Use a mask array where:
0= use this data point1= exclude this data point
# Analyze only first and last 100 points
data = np.random.randn(500, 3)
# Create mask (same shape as data)
mask = np.zeros(data.shape, dtype='int32')
mask[100:400, :] = 1 # Mask out the middle
df = pp.DataFrame(data, mask=mask)
Practical Example: Winter-Only Analysis
# 2 years of daily data
T = 730
data = np.random.randn(T, 3)
# Day of year (1-365)
day_of_year = np.tile(np.arange(1, 366), 2)[:T]
# Winter = Dec-Feb (days 335-365 and 1-59)
is_winter = ((day_of_year >= 335) | (day_of_year <= 59))
mask = np.zeros((T, 3), dtype='int32')
mask[~is_winter, :] = 1 # Mask non-winter
df_winter = pp.DataFrame(data, mask=mask)
Multiple Datasets
Combine data from multiple sources (different experiments, locations).
# Data from 3 different sites
site_A = np.random.randn(200, 3)
site_B = np.random.randn(300, 3) # Different length OK!
site_C = np.random.randn(150, 3)
multi_data = {
0: site_A,
1: site_B,
2: site_C
}
df_multi = pp.DataFrame(
multi_data,
var_names=['Temp', 'Pressure', 'Quality'],
analysis_mode='multiple' # Important!
)
Data Types: Continuous vs. Discrete
Tigramite can handle mixed data.
# Temperature (continuous) and Machine State (discrete: 0, 1, 2)
temperature = np.random.randn(500)
pressure = np.random.randn(500)
machine_state = np.random.choice([0, 1, 2], size=500)
data = np.column_stack([temperature, pressure, machine_state])
# Specify data types: 0 = continuous, 1 = discrete
data_type = np.array([0, 0, 1])
df = pp.DataFrame(data, data_type=data_type)
Always Plot First!
ALWAYS visualize your data before running causal discovery.
# Plot time series
tp.plot_timeseries(dataframe, figsize=(12, 6))
plt.show()
# Check for linearity
tp.plot_scatterplots(dataframe=dataframe)
plt.show()
Summary: DataFrame Options
| Parameter | Purpose |
|---|---|
data | Your numpy array (T, N) |
var_names | Variable names |
datatime | Time axis |
missing_flag | Missing value marker |
mask | Which points to exclude |
data_type | 0=continuous, 1=discrete |
analysis_mode | 'multiple' for multiple datasets |