Data Sources#

CORR-Vars supports multiple data sources for extracting clinical data. Each source provides specialized access to different types of healthcare data.

Available Sources#

Overview#

Data sources in CORR-Vars are modular components that handle the extraction and preprocessing of clinical data from different healthcare systems. Each source implements a standardized interface while providing source-specific optimizations and features.

CUB-HDP (Charité University Berlin - Health Data Platform)

Primary data source providing access to Charité’s clinical data warehouse through Hadoop/Impala infrastructure.

ReprodicU (Reproductive ICU)

Specialized source for reproductive and perinatal intensive care data.

Multi-Source Configuration#

You can combine multiple data sources in a single cohort:

from corr_vars import Cohort

# Configure multiple sources
cohort = Cohort(
    obs_level="icu_stay",
    sources={
        "cub_hdp": {
            "database": "db_hypercapnia_prepared",
            "password_file": True
        },
        "reprodicu": {
            "path": "/data02/projects/reprodicubility/reprodICU/reprodICU_files"
        }
    }
)

# Data source tracking
print(cohort.obs["data_source"].value_counts())

Source-Specific Variables#

Different sources may provide different variables. The variable loader automatically handles source selection:

# This variable might be available from CUB-HDP only
cohort.add_variable("apache_ii_score")

# Variables are automatically tagged with their source
print("Available sources for this variable:")
print(cohort.obsm["blood_pressure"]["data_source"].unique())

Variable Loader#

The variable loader provides the core functionality for loading variables from configured sources:

corr_vars.sources.var_loader.load_variable(var_name, cohort, tmin=None, tmax=None)[source]#

Iterates through sources and gets variable from each source.

Parameters:
  • var_name (str) – The name of the variable to load.

  • cohort (Cohort) – The cohort to load the variable for.

  • tmin (TimeBoundColumn | None) – Minimum time for extraction (default: None).

  • tmax (TimeBoundColumn | None) – Maximum time for extraction (default: None).

Returns:

The loaded variable ready for extraction.

Return type:

MultiSourceVariable

Base Sources Module#