Troubleshooting Guide#

AI-generated content

This page was generated by AI and has not been fully reviewed by a human. Content may be inaccurate or incomplete. If you find any issues, please create an issue on the GitHub repository.

This guide covers common issues you might encounter while using CORR-Vars and their solutions.

Database Connection Issues#

Problem: Cannot connect to the database#

Error messages you might see: - ConnectionError: Failed to connect to hdl-edge01.charite.de - AuthenticationError: Invalid credentials - TimeoutError: Database connection timed out

Solutions:

  1. Check your password file:

# Verify password file exists and has correct permissions
ls -la ~/password.txt
# Should show: -rw------- (600 permissions)

# If file doesn't exist, create it:
echo "your_password_here" > ~/password.txt
chmod 600 ~/password.txt
  1. Test database connectivity:

# Test with explicit connection arguments
cohort = Cohort(
    obs_level="icu_stay",
    load_default_vars=False,
    sources={
        "cub_hdp": {
            "database": "db_hypercapnia_prepared",
            "conn_args": {
                "remote_hostname": "hdl-edge01.charite.de",
                "username": "your_username"  # Replace with your username
            }
        }
    }
)
  1. Check VPN connection:

# Test if you can reach the database server
ping hdl-edge01.charite.de

# Check if you're able to reach the database server on port 8443
curl -I https://hdl-edge01.charite.de:8443
  1. Use alternative password methods:

# Method 1: Custom password file location
cohort = Cohort(
    sources={
        "cub_hdp": {
            "password_file": "/path/to/your/password.txt"
        }
    }
)

# Method 2: Interactive password prompt (password_file=False)
cohort = Cohort(
    sources={
        "cub_hdp": {
            "password_file": False  # Will prompt for password
        }
    }
)

Variable Loading Issues#

Problem: Variable not found#

Error: ValueError: Variable 'my_variable' not found

Solutions:

  1. Check variable name spelling:

# Use the variable explorer to find correct names
# Visit: http://10.32.42.190:8080/

# Or check what variables are available
from corr_vars.sources.cub_hdp.mapping import VARS
available_vars = list(VARS["variables"].keys())
print(f"Available variables: {len(available_vars)}")

# Search for similar variable names
search_term = "sodium"
matching = [v for v in available_vars if search_term.lower() in v.lower()]
print(f"Variables containing '{search_term}': {matching}")
  1. Check if variable exists for your observation level:

# Some variables are only available for certain observation levels
cohort_icu = Cohort(obs_level="icu_stay", load_default_vars=False)
cohort_hospital = Cohort(obs_level="hospital_stay", load_default_vars=False)

# Try adding the variable to different observation levels
try:
    cohort_icu.add_variable("your_variable")
    print("Variable available for ICU stays")
except:
    print("Variable not available for ICU stays")

Problem: Variable extraction takes too long#

Solutions:

  1. Use time constraints to limit data:

# Instead of loading all data
cohort.add_variable("blood_sodium")  # Might be slow

# Load only recent data
cohort.add_variable(
    "blood_sodium",
    tmin=("icu_admission", "-24h"),
    tmax=("icu_admission", "+48h")
)
  1. Filter your cohort first:

# Filter cohort before adding expensive variables
cohort = Cohort(
    sources={
        "cub_hdp": {
            "database": "db_hypercapnia_prepared",
            "password_file": True,
            "filters": "_d2"  # Only 2 months
        }
    }
)

# Apply inclusion criteria early
cohort.include_list([
    {"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"}
])

# Then add expensive variables
cohort.add_variable("complex_variable")
  1. Use load_default_vars=False for large cohorts:

# Don't load all default variables automatically
cohort = Cohort(obs_level="icu_stay", load_default_vars=False)

# Add only the variables you need
needed_vars = ["age_on_admission", "sex", "inhospital_death"]
for var in needed_vars:
    cohort.add_variable(var)

Memory Issues#

Problem: Out of memory errors#

Error messages: - MemoryError: Unable to allocate array - Kernel died (in Jupyter)

Solutions:

  1. Work with smaller cohorts:

# Use date filters to reduce cohort size
cohort = Cohort(
    sources={
        "cub_hdp": {
            "filters": "c_aufnahme >= '2023-01-01' AND c_aufnahme < '2023-07-01'"
        }
    }
)
  1. Process variables in batches:

# Instead of loading all variables at once
all_vars = ["var1", "var2", "var3", "var4", "var5"]

# Load in smaller batches
batch_size = 2
for i in range(0, len(all_vars), batch_size):
    batch = all_vars[i:i+batch_size]
    for var in batch:
        cohort.add_variable(var)

    # Optionally save intermediate results
    cohort.save(f"cohort_batch_{i//batch_size}.corr2")
  1. Use Polars instead of Pandas (new API):

# Use the new Polars-based API (more memory efficient)
from corr_vars import Cohort  # New API

# Instead of legacy pandas API
# from corr_vars.legacy_v1 import Cohort  # Legacy API
  1. Monitor memory usage:

import psutil
import os

def check_memory():
    process = psutil.Process(os.getpid())
    memory_mb = process.memory_info().rss / 1024 / 1024
    print(f"Memory usage: {memory_mb:.1f} MB")

# Check memory before/after operations
check_memory()
cohort.add_variable("large_variable")
check_memory()

Data Quality Issues#

Problem: Unexpected or missing values#

Solutions:

  1. Inspect data quality:

# Check for missing data
missing_counts = cohort.obs.null_count()
print("Missing data per column:")
print(missing_counts)

# Check data ranges
numeric_cols = cohort.obs.select(pl.col(pl.NUMERIC_DTYPES)).columns
for col in numeric_cols:
    stats = cohort.obs[col].describe()
    print(f"\n{col} statistics:")
    print(stats)
  1. Validate time-series data:

# Check dynamic variable data quality
sodium_data = cohort.obsm["blood_sodium"]

print(f"Total sodium measurements: {len(sodium_data)}")
print(f"Patients with sodium data: {sodium_data['icu_stay_id'].n_unique()}")
print(f"Value range: {sodium_data['value'].min()} - {sodium_data['value'].max()}")

# Check for impossible values
impossible = sodium_data.filter(
    (pl.col("value") < 100) | (pl.col("value") > 200)
)
print(f"Impossible sodium values: {len(impossible)}")
  1. Apply data cleaning:

from corr_vars.sources.cub_hdp import Variable

# Add cleaning rules when extracting variables
clean_sodium = Variable(
    var_name="blood_sodium_clean",
    table="it_ishmed_labor",
    where="c_katalog_leistungtext LIKE '%atrium%'",
    value_dtype="DOUBLE",
    cleaning={"value": {"low": 120, "high": 180}},  # Physiologically plausible
    dynamic=True
)
cohort.add_variable(clean_sodium)

Custom Variable Issues#

Problem: Custom function not working#

Error: AttributeError: module has no attribute 'my_function'

Solutions:

  1. Check function definition location:

# Functions must be defined in the appropriate variables.py file
# For CUB-HDP: src/corr_vars/sources/cub_hdp/mapping/variables.py

# Check if function exists
import corr_vars.sources.cub_hdp.mapping.variables as var_funcs
print(dir(var_funcs))  # List all available functions
  1. Test custom function separately:

# Test your function with sample data
def my_custom_function(var, cohort):
    # Your function logic here
    print(f"Processing variable: {var.var_name}")
    print(f"Required variables: {var.requires}")
    print(f"Cohort size: {len(cohort.obs)}")

    # Return test data
    import polars as pl
    return pl.DataFrame({
        "icu_stay_id": cohort.obs["icu_stay_id"],
        "test_value": [1.0] * len(cohort.obs)
    })

# Test with a simple variable
from corr_vars.sources.cub_hdp import Variable
test_var = Variable(
    var_name="test_custom",
    requires=["age_on_admission"],
    complex=True,
    dynamic=False,
    py=my_custom_function
)
cohort.add_variable(test_var)

Performance Issues#

Problem: Queries are very slow#

Solutions:

  1. Optimize database queries:

# Use specific filters to reduce data volume
cohort = Cohort(
    sources={
        "cub_hdp": {
            "database": "db_hypercapnia_prepared",
            "password_file": True,
            "filters": """
                c_aufnahme >= '2023-01-01'
                AND c_aufnahme < '2024-01-01'
            """
        }
    }
)
  1. Load variables strategically:

# Load cheap variables first for filtering
cohort.add_variable("age_on_admission")  # Fast
cohort.add_variable("sex")               # Fast

# Apply filters
cohort.include_list([
    {"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"}
])

# Then load expensive variables on smaller cohort
cohort.add_variable("complex_time_series_variable")  # Slower
  1. Use caching:

# Save intermediate results
cohort.save("cohort_after_filtering.corr2")

# Load saved cohort instead of re-extracting
cohort = Cohort.load("cohort_after_filtering.corr2")

File and Save/Load Issues#

Problem: Cannot save or load cohort files#

Solutions:

  1. Check file permissions and paths:

# Check if directory is writable
touch test_file.txt
rm test_file.txt

# Use absolute paths
import os
save_path = os.path.abspath("my_cohort.corr2")
print(f"Saving to: {save_path}")
  1. Handle large file sizes:

# For very large cohorts, save without some variables
large_vars = ["variable_with_millions_of_rows"]

# Temporarily remove large variables
saved_data = {}
for var in large_vars:
    if var in cohort.obsm:
        saved_data[var] = cohort._obsm.pop(var)

# Save smaller cohort
cohort.save("cohort_without_large_vars.corr2")

# Restore variables
cohort._obsm.update(saved_data)
  1. Migrate from old file formats:

# Load old format and save as new
try:
    old_cohort = Cohort.load("old_file.corr")  # Legacy format
    old_cohort.save("new_file.corr2")          # New format
    print("Successfully migrated file format")
except Exception as e:
    print(f"Migration failed: {e}")

Getting Help#

When you encounter issues not covered here:

  1. Check the logs:

# Enable verbose logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Or configure CORR-Vars logging
cohort = Cohort(
    logger_args={
        "level": logging.DEBUG,
        "colored_output": True
    }
)
  1. Create a minimal reproducible example:

# Simplify your code to the minimal case that reproduces the error
from corr_vars import Cohort

cohort = Cohort(
    obs_level="icu_stay",
    load_default_vars=False,
    sources={"cub_hdp": {"database": "db_hypercapnia_prepared", "filters": "_d1"}}
)

# Try the operation that's failing
cohort.add_variable("problematic_variable")
  1. Gather system information:

# Get debug information
cohort.debug_print()

# Check versions
import corr_vars
import polars as pl
import pandas as pd

print(f"CORR-Vars version: {corr_vars.__version__}")
print(f"Polars version: {pl.__version__}")
print(f"Pandas version: {pd.__version__}")
  1. Contact support:

  • Create an issue on GitHub: cub-corr/corr-vars#issues

  • Include your minimal reproducible example

  • Include the error message and debug information

  • Describe what you expected to happen vs. what actually happened

Remember: Most issues can be resolved by starting with a smaller dataset and gradually increasing complexity!