Tutorials and Getting Started#
AI-generated content
This page was generated by AI and has not been fully reviewed by a human. Content may be inaccurate or incomplete. If you find any issues, please create an issue on the GitHub repository.
Step-by-step guides to master CORR-Vars for clinical research
π Learning Path Overview
This section provides hands-on tutorials to help you get started with CORR-Vars and accomplish common research tasks. Each tutorial builds on the previous one, taking you from basic concepts to advanced research workflows.
β±οΈ 30 minutes
Create your first cohort, add variables, and generate summary statistics
β±οΈ 45 minutes
Custom aggregations, time-specific windows, and derived calculations
β±οΈ 20 minutes
Combine CUB-HDP and reprodICU for comprehensive international studies
β±οΈ 35 minutes
Advanced time-series analysis and dynamic variable patterns
Getting Started - Your First Analysis#
π― Tutorial Goals
What youβll learn: Create your first cohort, add variables, apply inclusion criteria, and generate publication-ready summary tables.
Time required: ~30 minutes
Prerequisites: Access to IMI server or local CORR-Vars installation
π₯ Your First Research Journey
In this tutorial, we'll build an ICU cohort studying hypernatremia (elevated sodium levels) and generate a summary table comparing patient characteristics by sex. This mirrors real clinical research workflows used in critical care studies.
Step 1: Environment Setup#
Connect to the server:
Install VS Code Remote-SSH extension
Connect to:
s-c01-imi-app01.charite.deActivate the CORR-Vars environment:
# Activate the pre-installed CORR-Vars environment
conda activate /data02/projects/icurepo/.pkg/env10
π Access Issues?
If conda activation fails, ask Patrick Heeren to add you to the miniconda-users group.
Install CORR-Vars locally:
# Clone and install
git clone https://github.com/cub-corr/corr-vars.git
cd corr-vars
pip install -e .
β οΈ Note
Local installation requires GitHub access and wonβt have access to CUB-HDP database.
Start Jupyter: In VS Code: Create new .ipynb file and select the CORR-Vars kernel.
Step 2: Basic Cohort Creation#
π‘ Understanding the Code
Weβll start with a minimal cohort to understand the basics before adding complexity.
# Import the main CORR-Vars class
from corr_vars import Cohort
# Create your first cohort - start small for learning
cohort = Cohort(
obs_level="icu_stay", # Each row = one ICU admission
load_default_vars=False, # Start clean (we'll add variables manually)
sources={
"cub_hdp": {
"database": "db_hypercapnia_prepared",
"password_file": True, # Uses ~/password.txt for authentication
}
}
)
print(f"π₯ Created cohort with {len(cohort.obs):,} ICU stays")
print(f"π Available columns: {list(cohort.obs.columns)}")
Step 3: Exploring Your Cohort#
π Data Exploration Best Practice
Always explore your data first! Understanding the scope and characteristics of your cohort is crucial before analysis.
# Look at the first few rows to understand the data structure
print("π First 5 rows:")
print(cohort.obs.head())
print("\n" + "="*50)
# Check the temporal scope of your data
print("π
Temporal coverage:")
print(f" Earliest admission: {cohort.obs['icu_admission'].min()}")
print(f" Latest admission: {cohort.obs['icu_admission'].max()}")
# Basic cohort statistics
print("\nπ Cohort statistics:")
print(f" Total ICU stays: {len(cohort.obs):,}")
print(f" Unique patients: {cohort.obs['patient_id'].n_unique():,}")
print(f" Unique hospital cases: {cohort.obs['case_id'].n_unique():,}")
# Calculate readmission rate
total_stays = len(cohort.obs)
unique_patients = cohort.obs['patient_id'].n_unique()
readmission_rate = (total_stays - unique_patients) / unique_patients * 100
print(f" Readmission rate: {readmission_rate:.1f}%")
Step 4: Adding Your First Variables#
π§© Variable Types in CORR-Vars
Static variables (
cohort.obs): One value per ICU stay (age, sex, outcome)Dynamic variables (
cohort.obsm): Time-series data (lab values, vital signs)
# Add static (single value per ICU stay) variables
print("π§ Adding static variables...")
cohort.add_variable("hospital_length_of_stay") # Days in hospital
cohort.add_variable("campus_id") # Which hospital campus
# Add a dynamic (time-series) variable - this will be key for our analysis!
print("π Adding dynamic variables...")
cohort.add_variable("blood_sodium") # Sodium lab values over time
print("\nβ
Variables added successfully!")
print(f"π Static variables in cohort.obs: {len(cohort.obs.columns)} columns")
print(f"π Dynamic variables in cohort.obsm: {list(cohort.obsm.keys())}")
# Quick look at our sodium data
sodium_data = cohort.obsm['blood_sodium']
print(f"\nπ§ͺ Sodium measurements: {len(sodium_data):,} total records")
print(f" Patients with sodium data: {sodium_data['icu_stay_id'].n_unique():,}")
print(f" Value range: {sodium_data['value'].min():.1f} - {sodium_data['value'].max():.1f} mmol/L")
π¬ Understanding Dynamic Variables
Dynamic variables contain multiple measurements per patient over time:
# Peek at sodium data structure
print(sodium_data.head())
# Shows: icu_stay_id, recordtime, value columns
This time-series structure enables powerful temporal analyses like calculating trends, clearances, and time-to-event outcomes.
Step 5: Examining the Data#
π Data Quality Check
Always examine your data before analysis to understand completeness and identify potential issues.
# Examine static data (one row per ICU stay)
# Note: Basic demographics are automatically included
print("π Static data sample (first 5 patients):")
static_sample = cohort.obs.select([
"icu_stay_id", "age_on_admission", "sex",
"inhospital_death", "hospital_length_of_stay", "campus_id"
]).head()
print(static_sample)
print("\n" + "="*60)
# Examine dynamic data (time-series)
print(f"π§ͺ Sodium measurements overview:")
sodium_data = cohort.obsm['blood_sodium']
print(f" Total measurements: {len(sodium_data):,}")
print(f" Date range: {sodium_data['recordtime'].min()} to {sodium_data['recordtime'].max()}")
print("\nπ Sample sodium data:")
print(sodium_data.head())
print("\n" + "="*60)
# Data completeness assessment
print("π Data completeness check:")
total_patients = len(cohort.obs)
print(f" Age missing: {cohort.obs['age_on_admission'].null_count():,}/{total_patients:,} ({cohort.obs['age_on_admission'].null_count()/total_patients*100:.1f}%)")
print(f" Sex missing: {cohort.obs['sex'].null_count():,}/{total_patients:,} ({cohort.obs['sex'].null_count()/total_patients*100:.1f}%)")
print(f" Hospital LOS missing: {cohort.obs['hospital_length_of_stay'].null_count():,}/{total_patients:,} ({cohort.obs['hospital_length_of_stay'].null_count()/total_patients*100:.1f}%)")
# Sodium data availability
patients_with_sodium = sodium_data['icu_stay_id'].n_unique()
print(f" Patients with sodium: {patients_with_sodium:,}/{total_patients:,} ({patients_with_sodium/total_patients*100:.1f}%)")
Step 6: Creating Aggregated Variables#
π― Aggregated Variables Concept
Transform time-series β single values: Convert multiple sodium measurements per patient into clinically meaningful summary statistics (first value, maximum, trends, etc.)
# Import the aggregation tools
from corr_vars.sources.aggregation import NativeStatic
print("π§ Creating aggregated sodium variables...")
# Variable 1: First sodium measurement (baseline value)
first_sodium = NativeStatic(
var_name="first_sodium_value",
select="!first value", # Get the first measurement
base_var="blood_sodium", # From the sodium time-series
tmin="icu_admission", # Start from ICU admission
tmax="icu_discharge" # End at ICU discharge
)
cohort.add_variable(first_sodium)
# Variable 2: Maximum sodium during ICU stay (peak value)
max_sodium = NativeStatic(
var_name="max_sodium_icu",
select="!max value", # Get the maximum measurement
base_var="blood_sodium",
tmin="icu_admission",
tmax="icu_discharge"
)
cohort.add_variable(max_sodium)
print("β
Aggregated variables created successfully!")
# Examine the results
print("\nπ Sample of aggregated sodium data:")
sodium_summary = cohort.obs.select([
"icu_stay_id", "first_sodium_value", "max_sodium_icu"
]).head()
print(sodium_summary)
# Quick statistics
print("\nπ Summary statistics:")
print(f" First sodium - Mean: {cohort.obs['first_sodium_value'].mean():.1f} mmol/L")
print(f" Max sodium - Mean: {cohort.obs['max_sodium_icu'].mean():.1f} mmol/L")
print(f" Patients with data: {cohort.obs['first_sodium_value'].drop_nulls().len():,}")
π§ͺ Understanding Aggregation Syntax
CORR-Vars aggregation syntax follows this pattern:
select="!first value"β Get first measurementselect="!max value"β Get maximum measurementselect="!mean value"β Get averagetmin/tmaxβ Define time window for aggregation
This powerful syntax enables complex clinical variable definitions with simple, readable code.
Step 7: Setting Eligibility and Inclusion Criteria#
π― Clinical Research Focus: Hypernatremia
Hypernatremia (sodium >145 mmol/L) is a common electrolyte disorder in ICU patients, associated with increased mortality. Weβll study patients who develop this condition.
# Create a time anchor: when did hypernatremia first occur?
print("π Identifying hypernatremia onset...")
hypernatremia_onset = NativeStatic(
var_name="hypernatremia_onset_time",
select="!first recordtime", # Get the TIME of first occurrence
base_var="blood_sodium",
where="value > 145", # Clinical threshold for hypernatremia
tmin="icu_admission",
tmax="icu_discharge"
)
cohort.add_variable(hypernatremia_onset)
# Set this as our eligibility timepoint (study entry)
cohort.set_t_eligible("hypernatremia_onset_time")
# Analyze the results
total_patients = len(cohort.obs)
patients_with_hypernatremia = cohort.obs['hypernatremia_onset_time'].drop_nulls().len()
patients_without = total_patients - patients_with_hypernatremia
print(f"\nπ Hypernatremia analysis:")
print(f" Total patients: {total_patients:,}")
print(f" Developed hypernatremia: {patients_with_hypernatremia:,} ({patients_with_hypernatremia/total_patients*100:.1f}%)")
print(f" Never hypernatremic: {patients_without:,} ({patients_without/total_patients*100:.1f}%)")
# Show some example onset times
print("\nπ Sample hypernatremia onset times:")
onset_examples = cohort.obs.filter(
cohort.obs['hypernatremia_onset_time'].is_not_null()
).select(['icu_stay_id', 'icu_admission', 'hypernatremia_onset_time']).head()
print(onset_examples)
Step 8: Applying Inclusion Criteria#
π Inclusion/Exclusion Criteria
Systematic filtering ensures your study population is appropriate for your research question and meets clinical/regulatory standards.
# Apply systematic inclusion criteria
print("π Applying inclusion criteria...")
# Track cohort size at each step
initial_size = len(cohort.obs)
print(f"\nπ₯ Starting cohort size: {initial_size:,} patients")
ct = cohort.include_list([
{
"variable": "age_on_admission",
"operation": ">= 18",
"label": "Adult patients",
"operations_done": "Include only patients >= 18 years old"
},
{
"variable": "hypernatremia_onset_time",
"operation": "is_not_null", # Has hypernatremia (corrected syntax)
"label": "Developed hypernatremia",
"operations_done": "Include only patients with hypernatremia (Na > 145 mmol/L)"
}
])
final_size = len(cohort.obs)
print(f"\nβ
Final study cohort: {final_size:,} patients")
print(f"π Excluded: {initial_size - final_size:,} patients ({(initial_size-final_size)/initial_size*100:.1f}%)")
# Show the inclusion flow
print("\nπ Inclusion criteria results:")
print(ct.to_pandas()) # Shows detailed inclusion flow table
Step 9: Generate Summary Statistics#
π Table 1: Baseline Characteristics
Table 1 is the standard first table in medical research papers, showing baseline characteristics of your study population, often stratified by key variables.
# Generate a publication-ready Table 1
print("π Generating baseline characteristics table...")
table1 = cohort.tableone(
groupby="sex", # Compare males vs females
ignore_cols=[ # Exclude non-clinical variables
"hypernatremia_onset_time",
"icu_stay_id",
"patient_id",
"case_id"
],
pval=True # Include statistical tests
)
print("\nπ Baseline Characteristics by Sex:")
print("="*60)
print(table1)
# Save the results
output_filename = "hypernatremia_study_table1.csv"
table1.to_csv(output_filename)
print(f"\nπΎ Table saved to: {output_filename}")
# Generate additional summary statistics
print("\nπ Quick summary statistics:")
print(f" Mean age: {cohort.obs['age_on_admission'].mean():.1f} Β± {cohort.obs['age_on_admission'].std():.1f} years")
print(f" Male patients: {(cohort.obs['sex'] == 'M').sum():,} ({(cohort.obs['sex'] == 'M').mean()*100:.1f}%)")
print(f" Hospital mortality: {cohort.obs['inhospital_death'].mean()*100:.1f}%")
print(f" Median hospital LOS: {cohort.obs['hospital_length_of_stay'].median():.1f} days")
Step 10: Save Your Work#
πΎ Reproducible Research
Always save your work! CORR-Vars provides multiple formats for different downstream analyses.
# Save the cohort in CORR-Vars native format (fastest for future loading)
cohort_filename = f"hypernatremia_study_{len(cohort.obs)}patients.corr2"
cohort.save(cohort_filename)
print(f"πΎ Cohort saved to: {cohort_filename}")
# Export to CSV files for external analysis (R, SPSS, Excel, etc.)
csv_directory = "hypernatremia_study_exports"
cohort.to_csv(csv_directory)
print(f"π CSV exports saved to: {csv_directory}/")
print(f" - obs.csv: Static variables ({len(cohort.obs)} rows)")
print(f" - obsm_blood_sodium.csv: Sodium time-series ({len(cohort.obsm['blood_sodium'])} measurements)")
# Quick verification of saved files
print("\nβ
Verification - files created:")
import os
if os.path.exists(cohort_filename):
file_size = os.path.getsize(cohort_filename) / (1024*1024) # MB
print(f" ποΈ {cohort_filename} ({file_size:.1f} MB)")
if os.path.exists(csv_directory):
print(f" π {csv_directory}/ directory with CSV files")
print(f" π hypernatremia_study_table1.csv")
π Congratulations!
You've completed your first CORR-Vars analysis! You now have:
β
A filtered cohort of adult ICU patients with hypernatremia
β
Baseline characteristics and clinical variables
β
Publication-ready summary statistics
β
Saved data for further analysis
Ready for the next tutorial? π
Tutorial 2: Advanced Variable Creation#
π― Tutorial Goals
What youβll learn: Create sophisticated clinical variables using time-specific windows, custom aggregations, and mathematical expressions.
Time required: ~45 minutes
Prerequisites: Completion of Tutorial 1
π¬ Advanced Clinical Variables
In this tutorial, we'll create sophisticated variables used in real clinical research: admission vital signs, derived indices like shock index, and complex time-based calculations. These techniques form the backbone of advanced ICU outcomes research.
This tutorial covers creating custom variables and working with complex data transformations that mirror real clinical research needs.
Custom Aggregation Variables#
from corr_vars import Cohort
from corr_vars.sources.aggregation import NativeStatic, DerivedStatic
# Initialize cohort
cohort = Cohort(obs_level="icu_stay", load_default_vars=False)
# Add base variables first
cohort.add_variable("systolic_blood_pressure")
cohort.add_variable("heart_rate")
cohort.add_variable("body_temperature")
Time-Specific Aggregations:
# Variables from specific time windows
admission_vitals = [
NativeStatic(
var_name="admission_sbp",
select="!closest(icu_admission, 0, 2h) value",
base_var="systolic_blood_pressure"
),
NativeStatic(
var_name="admission_hr",
select="!closest(icu_admission, 0, 2h) value",
base_var="heart_rate"
),
NativeStatic(
var_name="max_temp_24h",
select="!max value",
base_var="body_temperature",
)
]
# Add all variables
for var in admission_vitals:
# By setting tmin and tmax, we are filtering the data that is considered for the aggregation
# Note that !closest still only considers Β±2h around the time of interest as specified in the select clause
var.tmin = ("icu_admission", "-2h")
var.tmax = ("icu_admission", "+24h")
cohort.add_variable(var)
Derived Variables with Expressions:
# Calculate shock index (HR/SBP)
shock_index = DerivedStatic(
var_name="shock_index_admission",
requires=["admission_hr", "admission_sbp"],
expression="admission_hr / admission_sbp"
)
cohort.add_variable(shock_index)
# Classify fever
fever_status = DerivedStatic(
var_name="fever_in_24h",
requires=["max_temp_24h"],
expression="max_temp_24h >= 38.0"
)
cohort.add_variable(fever_status)
Tutorial 3: Multi-Source Analysis#
π― Tutorial Goals
What youβll learn: Combine local CUB-HDP data with international reprodICU datasets for comprehensive, multi-center analysis.
Time required: ~20 minutes
Prerequisites: Access to reprodICU data (contact Finn Fassbender)
π International Multi-Center Research
Combine local Berlin data (CUB-HDP) with international critical care data (reprodICU: 469K+ admissions from US/Europe) for unprecedented scale and external validation capabilities.
π Access Requirements
reprodICU path verification: Check path
/data02/projects/reprodicubility/reprodICU/reprodICU_fileson your serverDatabase access: Contact Finn Fassbender for reprodICU access permissions
Working with multiple data sources enables comprehensive analysis combining local expertise with international generalizability.
# Configure multiple sources
cohort = Cohort(
obs_level="icu_stay",
sources={
"cub_hdp": {
"database": "db_hypercapnia_prepared",
"password_file": True
},
"reprodicu": {
"path": "/data02/projects/reprodicubility/reprodICU/reprodICU_files"
}
}
)
# Check source distribution
source_counts = cohort.obs["data_source"].value_counts()
print("Patients by data source:")
print(source_counts)
# Add variables (will be available from applicable sources)
cohort.add_variable("age_on_admission")
cohort.add_variable("apache_ii_score") # May not be available from all sources
# Analyze by source
summary_by_source = cohort.obs.group_by("data_source").agg([
pl.count().alias("n_patients"),
pl.col("age_on_admission").mean().alias("mean_age"),
pl.col("apache_ii_score").mean().alias("mean_apache")
])
print(summary_by_source)
Tutorial 4: Working with Time-Series Data#
π― Tutorial Goals
What youβll learn: Master time-series analysis in CORR-Vars: trend detection, rolling windows, time-to-event analysis, and temporal pattern recognition.
Time required: ~35 minutes
Prerequisites: Basic understanding of Polars DataFrames
π Temporal Clinical Patterns
Time-series analysis reveals dynamic clinical patterns: deteriorating lactate trends, blood pressure trajectories, medication responses, and early warning signals that static variables miss.
Advanced techniques for analyzing dynamic variables and extracting clinically meaningful temporal patterns.
# Add time-series variables
cohort.add_variable("blood_lactate")
cohort.add_variable("blood_pressure_mean")
cohort.add_variable("spo2")
# Examine time-series patterns
lactate_data = cohort.obsm["blood_lactate"]
# Find patients with rising lactate
lactate_trends = lactate_data.group_by("icu_stay_id").agg([
pl.col("value").first().alias("first_lactate"),
pl.col("value").last().alias("last_lactate"),
pl.col("value").max().alias("max_lactate"),
pl.count().alias("n_measurements")
])
# Calculate lactate change
lactate_trends = lactate_trends.with_columns([
(pl.col("last_lactate") - pl.col("first_lactate")).alias("lactate_change"),
(pl.col("max_lactate") / pl.col("first_lactate")).alias("lactate_ratio")
])
print("Lactate trends analysis:")
print(lactate_trends.head())
Tutorial 5: Complete Research Workflow Template#
π― Tutorial Goals
What youβll learn: End-to-end research workflow from hypothesis to publication-ready results using CORR-Vars best practices.
Time required: ~60 minutes
Use case: ICU outcomes study template
π₯ Complete ICU Outcomes Study
This comprehensive template demonstrates a complete research workflow: from initial cohort definition through statistical analysis to publication-ready outputs. Perfect for ICU outcomes research, quality improvement studies, and clinical epidemiology.
A complete, production-ready template for conducting sophisticated ICU outcomes studies.
from corr_vars import Cohort
from corr_vars.sources.aggregation import NativeStatic, DerivedStatic
import polars as pl
# 1. Initialize cohort with appropriate filters
cohort = Cohort(
obs_level="icu_stay",
load_default_vars=False,
sources={
"cub_hdp": {
"database": "db_hypercapnia_prepared",
"password_file": True,
"filters": "c_aufnahme >= '2020-01-01'" # Study period
}
}
)
print(f"Initial cohort: {len(cohort.obs)} ICU stays")
# 2. Add core variables
core_vars = [
"age_on_admission", "sex", "weight", "height",
"apache_ii_score", "sofa_score_admission",
"icu_length_of_stay", "inhospital_death",
"any_mechanical_ventilation", "any_vasopressors"
]
for var in core_vars:
try:
cohort.add_variable(var)
print(f"β Added {var}")
except Exception as e:
print(f"β Failed to add {var}: {e}")
# 3. Add exposure variable (example: early goal-directed therapy)
egdt_indicator = NativeStatic(
var_name="received_egdt",
select="!any",
base_var="any_vasopressors", # Simplified example
tmin="icu_admission",
tmax=("icu_admission", "+6h")
)
cohort.add_variable(egdt_indicator)
# 4. Apply inclusion/exclusion criteria
ct = cohort.include_list([
{
"variable": "age_on_admission",
"operation": ">= 18",
"label": "Adult patients"
},
{
"variable": "icu_length_of_stay",
"operation": "> 1",
"label": "ICU stay > 24h"
}
])
cohort.exclude_list([
{
"variable": "apache_ii_score",
"operation": "> 40",
"label": "Exclude APACHE II > 40"
}
])
print(f"Final study cohort: {len(cohort.obs)} patients")
# 5. Create analysis variables
bmi = DerivedStatic(
var_name="bmi",
requires=["body_weight", "body_height"],
expression="body_weight / (body_height / 100) ** 2"
)
cohort.add_variable(bmi)
# 6. Generate descriptive statistics
baseline_table = cohort.tableone(
groupby="received_egdt",
ignore_cols=["icu_id"],
pval=True
)
print("Baseline characteristics by exposure:")
print(baseline_table)
# 7. Save everything
cohort.save(f"icu_outcomes_study_{len(cohort.obs)}patients.corr2")
baseline_table.to_csv("baseline_characteristics.csv")
cohort.to_csv("study_data_export")
print("Study setup complete!")
Next Steps in Your CORR-Vars Journey#
π Continue Your Learning Journey
You've mastered the fundamentals! Ready to become a CORR-Vars expert?
Variable Explorer
Browse 300+ pre-defined clinical variables with interactive filtering and documentation
API Documentation
Comprehensive technical documentation for advanced features and customization
Variable Creation Guide
Learn to create your own clinical variables and contribute to the community
Problem Solving
Solutions for common issues, performance tips, and debugging strategies
π― Suggested Learning Path
π Master the Variable Explorer β Understand available clinical variables
π§ͺ Try Custom Variables β Create variables specific to your research
π€ Contribute Back β Share useful variables with the community
π Read Advanced Docs β Explore power-user features
π¬ Join the Community β Connect with other CORR-Vars researchers
π₯ Happy Coding with CORR-Vars!
"Advancing clinical research through innovative data science"