Troubleshooting Guide#
AI-generated content
This page was generated by AI and has not been fully reviewed by a human. Content may be inaccurate or incomplete. If you find any issues, please create an issue on the GitHub repository.
This guide covers common issues you might encounter while using CORR-Vars and their solutions.
Database Connection Issues#
Problem: Cannot connect to the database#
Error messages you might see:
- ConnectionError: Failed to connect to hdl-edge01.charite.de
- AuthenticationError: Invalid credentials
- TimeoutError: Database connection timed out
Solutions:
Check your password file:
# Verify password file exists and has correct permissions
ls -la ~/password.txt
# Should show: -rw------- (600 permissions)
# If file doesn't exist, create it:
echo "your_password_here" > ~/password.txt
chmod 600 ~/password.txt
Test database connectivity:
# Test with explicit connection arguments
cohort = Cohort(
obs_level="icu_stay",
load_default_vars=False,
sources={
"cub_hdp": {
"database": "db_hypercapnia_prepared",
"conn_args": {
"remote_hostname": "hdl-edge01.charite.de",
"username": "your_username" # Replace with your username
}
}
}
)
Check VPN connection:
# Test if you can reach the database server
ping hdl-edge01.charite.de
# Check if you're able to reach the database server on port 8443
curl -I https://hdl-edge01.charite.de:8443
Use alternative password methods:
# Method 1: Custom password file location
cohort = Cohort(
sources={
"cub_hdp": {
"password_file": "/path/to/your/password.txt"
}
}
)
# Method 2: Interactive password prompt (password_file=False)
cohort = Cohort(
sources={
"cub_hdp": {
"password_file": False # Will prompt for password
}
}
)
Variable Loading Issues#
Problem: Variable not found#
Error: ValueError: Variable 'my_variable' not found
Solutions:
Check variable name spelling:
# Use the variable explorer to find correct names
# Visit: http://10.32.42.190:8080/
# Or check what variables are available
from corr_vars.sources.cub_hdp.mapping import VARS
available_vars = list(VARS["variables"].keys())
print(f"Available variables: {len(available_vars)}")
# Search for similar variable names
search_term = "sodium"
matching = [v for v in available_vars if search_term.lower() in v.lower()]
print(f"Variables containing '{search_term}': {matching}")
Check if variable exists for your observation level:
# Some variables are only available for certain observation levels
cohort_icu = Cohort(obs_level="icu_stay", load_default_vars=False)
cohort_hospital = Cohort(obs_level="hospital_stay", load_default_vars=False)
# Try adding the variable to different observation levels
try:
cohort_icu.add_variable("your_variable")
print("Variable available for ICU stays")
except:
print("Variable not available for ICU stays")
Problem: Variable extraction takes too long#
Solutions:
Use time constraints to limit data:
# Instead of loading all data
cohort.add_variable("blood_sodium") # Might be slow
# Load only recent data
cohort.add_variable(
"blood_sodium",
tmin=("icu_admission", "-24h"),
tmax=("icu_admission", "+48h")
)
Filter your cohort first:
# Filter cohort before adding expensive variables
cohort = Cohort(
sources={
"cub_hdp": {
"database": "db_hypercapnia_prepared",
"password_file": True,
"filters": "_d2" # Only 2 months
}
}
)
# Apply inclusion criteria early
cohort.include_list([
{"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"}
])
# Then add expensive variables
cohort.add_variable("complex_variable")
Use load_default_vars=False for large cohorts:
# Don't load all default variables automatically
cohort = Cohort(obs_level="icu_stay", load_default_vars=False)
# Add only the variables you need
needed_vars = ["age_on_admission", "sex", "inhospital_death"]
for var in needed_vars:
cohort.add_variable(var)
Memory Issues#
Problem: Out of memory errors#
Error messages:
- MemoryError: Unable to allocate array
- Kernel died (in Jupyter)
Solutions:
Work with smaller cohorts:
# Use date filters to reduce cohort size
cohort = Cohort(
sources={
"cub_hdp": {
"filters": "c_aufnahme >= '2023-01-01' AND c_aufnahme < '2023-07-01'"
}
}
)
Process variables in batches:
# Instead of loading all variables at once
all_vars = ["var1", "var2", "var3", "var4", "var5"]
# Load in smaller batches
batch_size = 2
for i in range(0, len(all_vars), batch_size):
batch = all_vars[i:i+batch_size]
for var in batch:
cohort.add_variable(var)
# Optionally save intermediate results
cohort.save(f"cohort_batch_{i//batch_size}.corr2")
Use Polars instead of Pandas (new API):
# Use the new Polars-based API (more memory efficient)
from corr_vars import Cohort # New API
# Instead of legacy pandas API
# from corr_vars.legacy_v1 import Cohort # Legacy API
Monitor memory usage:
import psutil
import os
def check_memory():
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Memory usage: {memory_mb:.1f} MB")
# Check memory before/after operations
check_memory()
cohort.add_variable("large_variable")
check_memory()
Data Quality Issues#
Problem: Unexpected or missing values#
Solutions:
Inspect data quality:
# Check for missing data
missing_counts = cohort.obs.null_count()
print("Missing data per column:")
print(missing_counts)
# Check data ranges
numeric_cols = cohort.obs.select(pl.col(pl.NUMERIC_DTYPES)).columns
for col in numeric_cols:
stats = cohort.obs[col].describe()
print(f"\n{col} statistics:")
print(stats)
Validate time-series data:
# Check dynamic variable data quality
sodium_data = cohort.obsm["blood_sodium"]
print(f"Total sodium measurements: {len(sodium_data)}")
print(f"Patients with sodium data: {sodium_data['icu_stay_id'].n_unique()}")
print(f"Value range: {sodium_data['value'].min()} - {sodium_data['value'].max()}")
# Check for impossible values
impossible = sodium_data.filter(
(pl.col("value") < 100) | (pl.col("value") > 200)
)
print(f"Impossible sodium values: {len(impossible)}")
Apply data cleaning:
from corr_vars.sources.cub_hdp import Variable
# Add cleaning rules when extracting variables
clean_sodium = Variable(
var_name="blood_sodium_clean",
table="it_ishmed_labor",
where="c_katalog_leistungtext LIKE '%atrium%'",
value_dtype="DOUBLE",
cleaning={"value": {"low": 120, "high": 180}}, # Physiologically plausible
dynamic=True
)
cohort.add_variable(clean_sodium)
Custom Variable Issues#
Problem: Custom function not working#
Error: AttributeError: module has no attribute 'my_function'
Solutions:
Check function definition location:
# Functions must be defined in the appropriate variables.py file
# For CUB-HDP: src/corr_vars/sources/cub_hdp/mapping/variables.py
# Check if function exists
import corr_vars.sources.cub_hdp.mapping.variables as var_funcs
print(dir(var_funcs)) # List all available functions
Test custom function separately:
# Test your function with sample data
def my_custom_function(var, cohort):
# Your function logic here
print(f"Processing variable: {var.var_name}")
print(f"Required variables: {var.requires}")
print(f"Cohort size: {len(cohort.obs)}")
# Return test data
import polars as pl
return pl.DataFrame({
"icu_stay_id": cohort.obs["icu_stay_id"],
"test_value": [1.0] * len(cohort.obs)
})
# Test with a simple variable
from corr_vars.sources.cub_hdp import Variable
test_var = Variable(
var_name="test_custom",
requires=["age_on_admission"],
complex=True,
dynamic=False,
py=my_custom_function
)
cohort.add_variable(test_var)
Performance Issues#
Problem: Queries are very slow#
Solutions:
Optimize database queries:
# Use specific filters to reduce data volume
cohort = Cohort(
sources={
"cub_hdp": {
"database": "db_hypercapnia_prepared",
"password_file": True,
"filters": """
c_aufnahme >= '2023-01-01'
AND c_aufnahme < '2024-01-01'
"""
}
}
)
Load variables strategically:
# Load cheap variables first for filtering
cohort.add_variable("age_on_admission") # Fast
cohort.add_variable("sex") # Fast
# Apply filters
cohort.include_list([
{"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"}
])
# Then load expensive variables on smaller cohort
cohort.add_variable("complex_time_series_variable") # Slower
Use caching:
# Save intermediate results
cohort.save("cohort_after_filtering.corr2")
# Load saved cohort instead of re-extracting
cohort = Cohort.load("cohort_after_filtering.corr2")
File and Save/Load Issues#
Problem: Cannot save or load cohort files#
Solutions:
Check file permissions and paths:
# Check if directory is writable
touch test_file.txt
rm test_file.txt
# Use absolute paths
import os
save_path = os.path.abspath("my_cohort.corr2")
print(f"Saving to: {save_path}")
Handle large file sizes:
# For very large cohorts, save without some variables
large_vars = ["variable_with_millions_of_rows"]
# Temporarily remove large variables
saved_data = {}
for var in large_vars:
if var in cohort.obsm:
saved_data[var] = cohort._obsm.pop(var)
# Save smaller cohort
cohort.save("cohort_without_large_vars.corr2")
# Restore variables
cohort._obsm.update(saved_data)
Migrate from old file formats:
# Load old format and save as new
try:
old_cohort = Cohort.load("old_file.corr") # Legacy format
old_cohort.save("new_file.corr2") # New format
print("Successfully migrated file format")
except Exception as e:
print(f"Migration failed: {e}")
Getting Help#
When you encounter issues not covered here:
Check the logs:
# Enable verbose logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Or configure CORR-Vars logging
cohort = Cohort(
logger_args={
"level": logging.DEBUG,
"colored_output": True
}
)
Create a minimal reproducible example:
# Simplify your code to the minimal case that reproduces the error
from corr_vars import Cohort
cohort = Cohort(
obs_level="icu_stay",
load_default_vars=False,
sources={"cub_hdp": {"database": "db_hypercapnia_prepared", "filters": "_d1"}}
)
# Try the operation that's failing
cohort.add_variable("problematic_variable")
Gather system information:
# Get debug information
cohort.debug_print()
# Check versions
import corr_vars
import polars as pl
import pandas as pd
print(f"CORR-Vars version: {corr_vars.__version__}")
print(f"Polars version: {pl.__version__}")
print(f"Pandas version: {pd.__version__}")
Contact support:
Create an issue on GitHub: cub-corr/corr-vars#issues
Include your minimal reproducible example
Include the error message and debug information
Describe what you expected to happen vs. what actually happened
Remember: Most issues can be resolved by starting with a smaller dataset and gradually increasing complexity!