Data Sources

Data Sources#

CORR-Vars supports multiple data sources for extracting clinical data. Each source provides specialized access to different types of healthcare data.

Overview#

Data sources in CORR-Vars are modular components that handle the extraction and preprocessing of clinical data from different healthcare systems. Each source implements a standardized interface while providing source-specific optimizations and features.

CUB-HDP (Charité University Berlin - Health Data Platform): Primary data source providing access to Charité’s clinical data warehouse through Hadoop/Impala infrastructure.
CUB-HDP Dummy: Fully synthetic data source that mirrors CUB-HDP’s variable catalogue. No database connection required — ideal for development, testing, and demonstrations.
ReprodicU (Reproductive ICU): Specialized source for reproductive and perinatal intensive care data.

Multi-Source Configuration#

You can combine multiple data sources in a single cohort:

from corr_vars import Cohort

# Configure multiple sources
cohort = Cohort(
    obs_level="icu_stay",
    sources={
        "cub_hdp": {
            "database": "db_hypercapnia_prepared",
            "conn_args": {"password_file": True},
        },
        "reprodicu": {
            "path": "/data02/projects/reprodicubility/reprodICU/reprodICU_files"
        }
    }
)

# Data source tracking
print(cohort.obs["data_source"].value_counts())

Source-Specific Variables#

Different sources may provide different variables. The variable loader automatically handles source selection:

# This variable might be available from CUB-HDP only
cohort.add_variable("apache_ii_score")

# Variables are automatically tagged with their source
print("Available sources for this variable:")
print(cohort.obsm["blood_pressure"]["data_source"].unique())

Variable Loader#

The variable loader provides the core functionality for loading variables from configured sources:

corr_vars.sources.var_loader.load_variable(var_name, cohort, time_window, include_sources=None)[source]#

Load var_name from all configured sources and wrap it in a MultiSourceVariable.

Orchestrates config loading, project-override merging, time-window transfer, compatibility checks, and per-source variable instantiation.

Parameters:

var_name (str) – Name of the variable to load.
cohort (Cohort) – Cohort that supplies source configs and project overrides.
time_window (TimeWindow) – Time window for variable extraction.
include_sources (Iterable[str] | None) – Restrict to these sources. All cohort sources are used when None.

Returns:

Variable ready for extraction.

Return type:

MultiSourceVariable

Raises:

VariableNotFoundError – If var_name is not found in any source config, or if no source variable could be instantiated.

Data Sources

Contents

Data Sources#

Available Sources#

Overview#

Multi-Source Configuration#

Source-Specific Variables#

Variable Loader#

Base Sources Module#