Data Sources#
CORR-Vars supports multiple data sources for extracting clinical data. Each source provides specialized access to different types of healthcare data.
Available Sources#
Overview#
Data sources in CORR-Vars are modular components that handle the extraction and preprocessing of clinical data from different healthcare systems. Each source implements a standardized interface while providing source-specific optimizations and features.
- CUB-HDP (Charité University Berlin - Health Data Platform)
Primary data source providing access to Charité’s clinical data warehouse through Hadoop/Impala infrastructure.
- CUB-HDP Dummy
Fully synthetic data source that mirrors CUB-HDP’s variable catalogue. No database connection required — ideal for development, testing, and demonstrations.
- ReprodicU (Reproductive ICU)
Specialized source for reproductive and perinatal intensive care data.
Multi-Source Configuration#
You can combine multiple data sources in a single cohort:
from corr_vars import Cohort
# Configure multiple sources
cohort = Cohort(
obs_level="icu_stay",
sources={
"cub_hdp": {
"database": "db_hypercapnia_prepared",
"conn_args": {"password_file": True},
},
"reprodicu": {
"path": "/data02/projects/reprodicubility/reprodICU/reprodICU_files"
}
}
)
# Data source tracking
print(cohort.obs["data_source"].value_counts())
Source-Specific Variables#
Different sources may provide different variables. The variable loader automatically handles source selection:
# This variable might be available from CUB-HDP only
cohort.add_variable("apache_ii_score")
# Variables are automatically tagged with their source
print("Available sources for this variable:")
print(cohort.obsm["blood_pressure"]["data_source"].unique())
Variable Loader#
The variable loader provides the core functionality for loading variables from configured sources:
- corr_vars.sources.var_loader.load_variable(var_name, cohort, time_window, include_sources=None)[source]#
Load var_name from all configured sources and wrap it in a
MultiSourceVariable.Orchestrates config loading, project-override merging, time-window transfer, compatibility checks, and per-source variable instantiation.
- Parameters:
var_name (str) – Name of the variable to load.
cohort (Cohort) – Cohort that supplies source configs and project overrides.
time_window (TimeWindow) – Time window for variable extraction.
include_sources (Iterable[str] | None) – Restrict to these sources. All cohort sources are used when None.
- Returns:
Variable ready for extraction.
- Return type:
MultiSourceVariable
- Raises:
VariableNotFoundError – If var_name is not found in any source config, or if no source variable could be instantiated.