Contributing New Variables

Contributing New Variables#

AI-generated content

This page was generated by AI and has not been fully reviewed by a human. Content may be inaccurate or incomplete. If you find any issues, please create an issue on the GitHub repository.

🚀 Contribute to CORR-Vars

Help expand the CORR-Vars variable catalog! Follow this step-by-step guide to add new clinical variables that benefit the entire research community.

Overview #

Contributing new variables to CORR-Vars involves a structured development workflow that ensures quality, reproducibility, and proper integration. This tutorial walks you through the complete process from idea to merged pull request.

📋 Prerequisites

Access to the CORR-Vars GitHub repository
Development environment set up (see dev-environment-setup)
Basic familiarity with Git and GitHub workflows
Understanding of the variable types (see Custom Variables Guide)

Development Workflow Overview #

Step 1: Create a GitHub Issue #

🎯 Why Start with an Issue?

GitHub issues help track variable requests, avoid duplicate work, and provide a place for community discussion about the clinical relevance and implementation approach.

Check for Existing Issues

Before creating a new issue, search existing issues to avoid duplicates:

Visit the CORR-Vars GitHub repository
Click on the Issues tab
Search for keywords related to your variable (e.g., “lactate”, “SOFA”, “mechanical ventilation”)

Create a New Issue

If no existing issue covers your variable, create a new one:

Title: Add [Variable Name] variable
Example: "Add blood lactate clearance variable"

Issue Template:

## Variable Request: [Variable Name]

### Clinical Context
- **Purpose**: Brief description of clinical use case
- **Population**: Target patient population (ICU, hospital, specific conditions)
- **Evidence**: Reference to literature or clinical guidelines if available

### Technical Requirements
- **Variable Type**: Static/Dynamic, Derived/Native
- **Data Sources**: Which database tables/sources contain the data
- **Dependencies**: Any variables this depends on
- **Time Constraints**: Relevant time windows (admission, 24h, etc.)

### Expected Output
- **Data Type**: Numeric/Boolean/Categorical
- **Units**: Expected units of measurement
- **Range**: Expected value ranges
- **Example**: Sample output for a few patients

### Additional Information
- **Priority**: High/Medium/Low
- **Complexity**: Simple aggregation/Complex calculation/Database extraction
- **Timeline**: When this variable is needed

Example Issue:

Step 2: Set Up Development Environment and Create Feature Branch #

📚 Development Environment Setup

If you haven’t set up your development environment yet, follow the detailed setup guide:

Written Guide: dev-environment-setup
Video Tutorial: YouTube: CORR-Vars Development Setup

Clone and Setup (if first time)

# Clone the repository
git clone https://github.com/CUB-CORR/corr-vars.git
cd corr-vars

# Install in development mode
pip install -e .

# Install development dependencies
pip install pytest jupyter black ruff

Create Feature Branch

# Ensure you're on main and up to date
git checkout main
git pull origin main

# Create feature branch (use descriptive name referencing issue)
git checkout -b feature/add-lactate-clearance-variable

# Alternative naming patterns:
# git checkout -b feature/issue-123-lactate-clearance
# git checkout -b variable/blood-lactate-clearance

🌿 Branch Naming Conventions

Use descriptive branch names that reference the GitHub issue:

feature/add-[variable-name]
variable/[clinical-concept]
feature/issue-[number]-[description]

Step 3: Explore and Prototype in Jupyter Notebook #

🔬 Exploration Phase Goals

Understand the data structure and quality
Test different calculation approaches
Validate results against clinical expectations
Document any edge cases or limitations

Start Jupyter for Exploration

# Navigate to your development directory
cd /path/to/corr-vars

# Start Jupyter notebook
jupyter lab

Create Exploration Notebook

Create a new notebook: exploration/[variable-name]-development.ipynb

Exploration Template:

# Variable Development: Blood Lactate Clearance
# GitHub Issue: #123
# Developer: [Your Name]
# Date: [Today's Date]

import polars as pl
import pandas as pd
from corr_vars import Cohort
from corr_vars.sources.aggregation import DerivedStatic
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load a test cohort
cohort = Cohort(
    obs_level="icu_stay",
    load_default_vars=False,
    sources={
        "cub_hdp": {
            "database": "db_hypercapnia_prepared",
            "password_file": True,
            "filters": "_d1"  # Small dataset for testing
        }
    }
)

print(f"Test cohort size: {len(cohort.obs)} patients")

Data Exploration:

# 2. Add required base variables
cohort.add_variable("blood_lactate")

# 3. Explore the data structure
lactate_data = cohort.obsm["blood_lactate"]
print(f"Lactate measurements: {len(lactate_data)} records")
print(f"Patients with lactate: {lactate_data['icu_stay_id'].n_unique()}")
print(f"Value range: {lactate_data['value'].min()} - {lactate_data['value'].max()}")

# 4. Check data quality
print(f"Missing values: {lactate_data['value'].null_count()}")
print(f"Negative values: {len(lactate_data.filter(pl.col('value') < 0))}")

# 5. Sample data inspection
sample_patient = lactate_data['icu_stay_id'].unique()[0]
patient_data = lactate_data.filter(pl.col('icu_stay_id') == sample_patient)
print(f"Sample patient {sample_patient}:")
print(patient_data.to_pandas())

Prototype the Calculation:

# 6. Prototype lactate clearance calculation
def calculate_lactate_clearance_prototype(lactate_df):
    """
    Prototype function to calculate lactate clearance.

    Clearance = (Initial - Final) / Initial * 100
    """

    # Group by patient and calculate first/last values in first 24 hours
    result = lactate_df.filter(
        pl.col("recordtime") <= pl.col("recordtime").min().over("icu_stay_id") + pl.duration(hours=24)
    ).group_by("icu_stay_id").agg([
        pl.col("value").first().alias("initial_lactate"),
        pl.col("value").last().alias("final_lactate"),
        pl.col("recordtime").first().alias("initial_time"),
        pl.col("recordtime").last().alias("final_time"),
        pl.count().alias("n_measurements")
    ]).with_columns([
        # Calculate clearance percentage
        ((pl.col("initial_lactate") - pl.col("final_lactate")) / pl.col("initial_lactate") * 100)
        .alias("lactate_clearance_24h")
    ])

    return result

# Test the prototype
clearance_result = calculate_lactate_clearance_prototype(lactate_data)
print("Lactate clearance results:")
print(clearance_result.to_pandas().head())

Validate Results:

# 7. Validate results
# Check for reasonable ranges
clearance_stats = clearance_result["lactate_clearance_24h"].describe()
print("Clearance statistics:")
print(clearance_stats)

# Plot distribution
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
clearance_result.to_pandas()["lactate_clearance_24h"].hist(bins=30)
plt.title("Lactate Clearance Distribution")
plt.xlabel("Clearance (%)")

plt.subplot(1, 2, 2)
plt.scatter(clearance_result.to_pandas()["initial_lactate"],
            clearance_result.to_pandas()["lactate_clearance_24h"])
plt.xlabel("Initial Lactate (mmol/L)")
plt.ylabel("Clearance (%)")
plt.title("Clearance vs Initial Lactate")
plt.tight_layout()
plt.show()

# Identify potential edge cases
extreme_cases = clearance_result.filter(
    (pl.col("lactate_clearance_24h") < -50) | (pl.col("lactate_clearance_24h") > 100)
)
print(f"Extreme clearance values: {len(extreme_cases)} cases")
if len(extreme_cases) > 0:
    print(extreme_cases.to_pandas())

Refine the Implementation:

# 8. Refine based on exploration findings
def calculate_lactate_clearance_v2(var, cohort):
    """
    Refined lactate clearance calculation for CORR-Vars.
    """

    lactate_var = var.required_vars["blood_lactate"]
    lactate_data = lactate_var.data

    # Calculate clearance with improved logic
    result = lactate_data.group_by("icu_stay_id").agg([
        # Get first and last measurements in first 24h
        pl.col("value").first().alias("initial_lactate"),
        pl.col("value").last().alias("final_lactate"),
        pl.count().alias("n_measurements")
    ]).filter(
        # Only include patients with at least 2 measurements
        pl.col("n_measurements") >= 2
    ).with_columns([
        # Calculate clearance with bounds checking
        pl.when(pl.col("initial_lactate") > 0)
        .then(
            ((pl.col("initial_lactate") - pl.col("final_lactate")) / pl.col("initial_lactate") * 100)
            .clip(-200, 200)  # Reasonable bounds
        )
        .otherwise(None)
        .alias("lactate_clearance_24h")
    ]).select(["icu_stay_id", "lactate_clearance_24h"])

    return result

# Test refined version
print("Testing refined calculation...")
# [Include test code here]

Step 4: Implement Clean Code in Configuration Files #

🧹 Clean Implementation Goals

Minimal, production-ready code
Proper error handling
Clear documentation
Follows project conventions

Based on your exploration, implement the clean version in the appropriate configuration files.

For Simple Variables (No Custom Function Needed)

Most variables can be implemented using only vars.json without custom Python code:

Simple Aggregation

File: src/corr_vars/sources/cub_hdp/mapping/vars.json

{
  "variables": {
    "first_lactate_value": {
      "type": "aggregation",
      "base_var": "blood_lactate",
      "select": "!first value",
      "dynamic": false,
      "description": "First blood lactate measurement during ICU stay"
    },
    "max_lactate_24h": {
      "type": "aggregation",
      "base_var": "blood_lactate",
      "select": "!max value",
      "tmin": "icu_admission",
      "tmax": ["icu_admission", "+24h"],
      "dynamic": false,
      "description": "Maximum blood lactate in first 24 hours of ICU"
    }
  }
}

Expression-Based

File: src/corr_vars/sources/aggregation/vars.json

{
  "variables": {
    "shock_index_admission": {
      "type": "derived_static",
      "requires": ["admission_heart_rate", "admission_sbp"],
      "expression": "admission_heart_rate / admission_sbp",
      "cleaning": {"value": {"low": 0.1, "high": 5.0}},
      "dynamic": false,
      "description": "Shock index calculated from admission vital signs"
    }
  }
}

For Complex Variables (Custom Function Required)

For complex calculations that cannot be expressed as simple aggregations or expressions:

Step 4a: Add to vars.json

{
  "variables": {
    "lactate_clearance_24h": {
      "type": "complex",
      "requires": ["blood_lactate"],
      "dynamic": false,
      "py_ready_polars": true,
      "description": "Blood lactate clearance percentage over first 24 hours of ICU stay"
    }
  }
}

Step 4b: Implement Function in variables.py

File: src/corr_vars/sources/cub_hdp/mapping/variables.py

def lactate_clearance_24h(var, cohort):
    """
    Calculate blood lactate clearance over the first 24 hours of ICU stay.

    Clearance is calculated as: (Initial - Final) / Initial * 100

    Args:
        var: Variable object containing metadata and required variables
        cohort: Cohort object with access to patient data

    Returns:
        polars.DataFrame: DataFrame with icu_stay_id and lactate_clearance_24h columns

    Notes:
        - Requires at least 2 lactate measurements within 24h of ICU admission
        - Initial lactate must be > 0 to calculate meaningful clearance
        - Results are bounded between -200% and +200% to handle outliers
        - Missing or insufficient data results in null values
    """

    try:
        # Get lactate data
        lactate_var = var.required_vars["blood_lactate"]
        if lactate_var.data is None:
            raise ValueError("No lactate data available")

        lactate_data = lactate_var.data

        # Calculate 24h window from ICU admission for each patient
        cohort_times = cohort.obs.select(["icu_stay_id", "icu_admission"])

        # Join with lactate data and filter to 24h window
        windowed_data = lactate_data.join(
            cohort_times, on="icu_stay_id", how="inner"
        ).filter(
            (pl.col("recordtime") >= pl.col("icu_admission")) &
            (pl.col("recordtime") <= pl.col("icu_admission") + pl.duration(hours=24))
        )

        # Calculate clearance
        result = windowed_data.group_by("icu_stay_id").agg([
            pl.col("value").first().alias("initial_lactate"),
            pl.col("value").last().alias("final_lactate"),
            pl.count().alias("n_measurements")
        ]).filter(
            # Require at least 2 measurements
            pl.col("n_measurements") >= 2
        ).with_columns([
            # Calculate clearance with proper bounds
            pl.when(pl.col("initial_lactate") > 0)
            .then(
                ((pl.col("initial_lactate") - pl.col("final_lactate")) /
                 pl.col("initial_lactate") * 100).clip(-200, 200)
            )
            .otherwise(None)
            .alias("lactate_clearance_24h")
        ])

        # Return only required columns
        return result.select(["icu_stay_id", "lactate_clearance_24h"])

    except Exception as e:
        # Log error and return empty result with correct schema
        print(f"Error calculating lactate clearance: {e}")
        return pl.DataFrame({
            "icu_stay_id": [],
            "lactate_clearance_24h": []
        })

🔧 When to Use variables.py vs vars.json Only

Use vars.json only when:

Simple aggregations (first, last, max, min, mean, count)
Basic expressions involving arithmetic operations
Standard time window filtering

Use variables.py when:

Complex multi-step calculations
Custom business logic or clinical rules
Advanced data transformations
Error handling for edge cases
Integration of multiple data sources

Step 5: Verify Everything Works #

✅ Testing Checklist

Thoroughly test your implementation before creating a pull request.

Create a Test Script

Create test_[variable_name].py in your development directory:

"""
Test script for lactate clearance variable
Run this before submitting PR to ensure everything works
"""

from corr_vars import Cohort
import polars as pl

def test_lactate_clearance():
    """Test the new lactate clearance variable"""

    print("🧪 Testing lactate clearance variable...")

    # 1. Create test cohort
    cohort = Cohort(
        obs_level="icu_stay",
        load_default_vars=False,
        sources={
            "cub_hdp": {
                "database": "db_hypercapnia_prepared",
                "password_file": True,
                "filters": "_d1"  # Small test dataset
            }
        }
    )

    print(f"✓ Test cohort created: {len(cohort.obs)} patients")

    # 2. Add the new variable
    try:
        cohort.add_variable("lactate_clearance_24h")
        print("✓ Variable added successfully")
    except Exception as e:
        print(f"✗ Error adding variable: {e}")
        return False

    # 3. Check results
    if "lactate_clearance_24h" not in cohort.obs.columns:
        print("✗ Variable not found in cohort.obs")
        return False

    clearance_data = cohort.obs["lactate_clearance_24h"]
    n_valid = clearance_data.drop_nulls().len()
    n_total = len(clearance_data)

    print(f"✓ Results: {n_valid}/{n_total} patients have clearance values")

    # 4. Validate data quality
    if n_valid > 0:
        stats = clearance_data.drop_nulls().describe()
        print(f"✓ Value range: {clearance_data.min():.1f}% to {clearance_data.max():.1f}%")

        # Check for reasonable values
        extreme_count = clearance_data.filter(
            (pl.col("lactate_clearance_24h") < -200) |
            (pl.col("lactate_clearance_24h") > 200)
        ).len()

        if extreme_count > 0:
            print(f"⚠️  Warning: {extreme_count} extreme values found")
        else:
            print("✓ All values within reasonable range")

    print("🎉 Testing completed successfully!")
    return True

if __name__ == "__main__":
    success = test_lactate_clearance()
    if success:
        print("\n✅ Ready to create pull request!")
    else:
        print("\n❌ Fix issues before creating pull request")

Run the Test

python test_lactate_clearance.py

Manual Verification

# Quick manual check in Jupyter or Python shell
from corr_vars import Cohort

cohort = Cohort(obs_level="icu_stay", sources={"cub_hdp": {"filters": "_d1"}})
cohort.add_variable("lactate_clearance_24h")

# Check a few patients manually
print(cohort.obs.select(["icu_stay_id", "lactate_clearance_24h"]).head())

Step 6: Create Pull Request #

🔄 Pull Request Best Practices

A well-structured pull request makes review faster and increases the likelihood of acceptance.

Commit Your Changes

# Add your changes
git add src/corr_vars/sources/cub_hdp/mapping/vars.json
git add src/corr_vars/sources/cub_hdp/mapping/variables.py  # if needed

# Commit with descriptive message
git commit -m "Add lactate clearance variable (closes #123)

- Implements 24-hour lactate clearance calculation
- Requires minimum 2 measurements for reliability
- Handles edge cases with proper bounds (-200% to +200%)
- Tested on sample cohort with good data quality"

# Push to your feature branch
git push origin feature/add-lactate-clearance-variable

Create Pull Request on GitHub

Visit the CORR-Vars repository
Click “Compare & pull request” (should appear after pushing)
Fill out the pull request template:

PR Template:

## Add [Variable Name] Variable

Closes #[issue-number]

### Summary
Brief description of what this variable calculates and its clinical relevance.

### Changes Made
- [ ] Added variable definition to `vars.json`
- [ ] Implemented calculation function in `variables.py` (if needed)
- [ ] Tested on sample cohort
- [ ] Verified data quality and reasonable value ranges

### Variable Details
- **Type**: Static/Dynamic, Derived/Native
- **Dependencies**: List of required variables
- **Output**: Data type and expected range
- **Clinical Use**: Brief clinical context

### Testing
- [ ] Manual testing completed
- [ ] Edge cases handled
- [ ] Performance acceptable on large cohorts
- [ ] Documentation/comments added

### Review Checklist
- [ ] Code follows project style guidelines
- [ ] Variable name is descriptive and follows naming conventions
- [ ] Function includes proper docstring
- [ ] Error handling implemented
- [ ] No breaking changes to existing functionality

Example Pull Request:

Step 7: Trigger Unit Tests with Interactive Auth #

🔗 Automated Testing System

CORR-Vars uses automated unit tests to ensure new variables don’t break existing functionality and work correctly across different scenarios.

Wait for Bot Comment

After creating your pull request, an automated bot will post a comment with an interactive authentication link. This typically appears within 1-2 minutes:

🤖 **CORR-Vars Test Bot**

Thanks for your contribution! To run the automated tests, please click the link below to authenticate:

🔗 **[Click here to start unit tests](https://auth.corr-vars.charite.de/pr/123/auth)**

This will run the full test suite including:
✅ Unit tests for all existing variables
✅ Integration tests with your new variable
✅ Performance benchmarks
✅ Data quality checks

Click the Authentication Link

Click the authentication link in the bot comment
Log in with your Charité credentials
Authorize the test run for your pull request
Tests will start automatically (typically take 10-15 minutes)

Monitor Test Progress

The bot will update the PR with test progress:

🏃‍♂️ **Tests Running...**

Current Status:
✅ Code style checks (passed)
✅ Unit tests (passed)
🔄 Integration tests (running...)
⏳ Performance benchmarks (queued)

Test Results

Tests will complete with one of these outcomes:

✅ Tests Passed

🎉 **All Tests Passed!**

✅ Code style: All checks passed
✅ Unit tests: 847/847 passed
✅ Integration tests: All variables work correctly
✅ Performance: New variable adds <1s overhead
✅ Data quality: No issues detected

Your PR is ready for review! 🚀

❌ Tests Failed

❌ **Some Tests Failed**

✅ Code style: All checks passed
❌ Unit tests: 2/847 failed
✅ Integration tests: All variables work correctly
⚠️  Performance: New variable adds 15s overhead (threshold: 10s)

Please review the failed tests and update your code.

**Failed Tests:**
- test_lactate_clearance_edge_cases: Division by zero error
- test_lactate_clearance_performance: Timeout on large cohort

Fix Test Failures (if needed)

If tests fail, examine the error messages and update your code:

# Make fixes based on test feedback
git add -u
git commit -m "Fix edge case handling for zero lactate values"
git push origin feature/add-lactate-clearance-variable

# Tests will automatically re-run on the updated PR

Step 8: Request Review and Merge #

👥 Code Review Process

Code review ensures quality, shares knowledge, and catches issues before merge. Be patient and responsive to feedback!

Tag Reviewers

Once tests pass, tag appropriate reviewers in a comment:

@mthiele @nkronenberg Tests are passing! This lactate clearance variable is ready for review.

Key points for review:
- Clinical validation: Formula matches literature standard
- Edge case handling: Tested with zero/negative lactate values
- Performance: <2s overhead on 10k patient cohorts
- Documentation: Full docstring with clinical context

Respond to Review Feedback

Reviewers may request changes or ask questions:

Address Feedback:

# Make requested changes
# Edit variables.py to add validation and rename variable

git add -u
git commit -m "Address review feedback:

- Add validation for lactate >20 mmol/L
- Rename to lactate_clearance_24h for clarity
- Add TODO for 6-hour clearance variant"

git push origin feature/add-lactate-clearance-variable

Final Approval and Merge

Once reviewers approve:

Maintainer merges your pull request
Variable becomes available in the next release
GitHub issue closes automatically
Your contribution is live! 🎉

🎊 Congratulations!

Your variable is now part of CORR-Vars and available to researchers worldwide! You’ve contributed to advancing clinical research with real-world data.

Post-Merge Follow-up #

Clean Up Your Local Environment

# Switch back to main and update
git checkout main
git pull origin main

# Delete your feature branch (optional)
git branch -d feature/add-lactate-clearance-variable

Monitor Usage and Feedback

Watch for any issues reported with your variable
Consider contributing documentation or examples
Think about related variables that could be added

Share Your Success

Add your contribution to your CV/portfolio
Share with your research team
Consider presenting at department meetings

Common Pitfalls and Tips #

⚠️ Common Mistakes to Avoid

Learn from others’ experiences to avoid these common issues:

🐛 Data Quality Issues

Problem: Not handling missing/invalid data
Solution: Always validate inputs and handle edge cases
Example: Check for null values, negative measurements, extreme outliers

⏱️ Performance Problems

Problem: Slow calculations on large cohorts
Solution: Use efficient Polars operations, avoid loops
Example: Use .group_by() instead of patient-by-patient processing

📝 Poor Documentation

Problem: Unclear variable purpose or calculation
Solution: Write comprehensive docstrings with clinical context
Example: Include formula, units, expected ranges, clinical use

🧪 Insufficient Testing

Problem: Edge cases not discovered until production
Solution: Test with diverse patient populations and data scenarios
Example: Test with single measurements, extreme values, missing data

Pro Tips for Success:

💡 Expert Tips

Start Simple: Begin with basic aggregations before complex calculations
Clinical Validation: Verify results match clinical expectations
Performance First: Optimize for speed from the beginning
Document Everything: Future users (including yourself) will thank you
Ask for Help: Engage with the community early and often
Iterative Development: Get feedback on design before full implementation

Additional Resources #

📚 Helpful Links

Custom Variables Guide - Detailed guide to variable types and patterns
Troubleshooting Guide - Solutions for common development issues
GitHub Repository - Source code and issues
Variable Explorer - Browse existing variables
Development Chat - Get help from developers

Development Tools:

Code Style: Follow PEP 8, use black for formatting
Testing: Write unit tests for complex functions
Documentation: Use clear docstrings and type hints
Version Control: Make atomic commits with descriptive messages

🤝 Join the Community

Become an active contributor to the CORR-Vars ecosystem:

Contribute Variables: Start with this tutorial
Improve Documentation: Help other researchers learn
Report Issues: Identify bugs and suggest improvements
Share Experience: Present your work at conferences
Mentor Others: Help new contributors get started

—

Happy contributing! Your clinical expertise combined with this development workflow will help advance medical research worldwide. 🚀

Contributing New Variables

Contents

Contributing New Variables#