Data Sources#
CORR-Vars supports multiple data sources for extracting clinical data. Each source provides specialized access to different types of healthcare data.
Available Sources#
Overview#
Data sources in CORR-Vars are modular components that handle the extraction and preprocessing of clinical data from different healthcare systems. Each source implements a standardized interface while providing source-specific optimizations and features.
- CUB-HDP (Charité University Berlin - Health Data Platform)
Primary data source providing access to Charité’s clinical data warehouse through Hadoop/Impala infrastructure.
- ReprodicU (Reproductive ICU)
Specialized source for reproductive and perinatal intensive care data.
Multi-Source Configuration#
You can combine multiple data sources in a single cohort:
from corr_vars import Cohort
# Configure multiple sources
cohort = Cohort(
obs_level="icu_stay",
sources={
"cub_hdp": {
"database": "db_hypercapnia_prepared",
"password_file": True
},
"reprodicu": {
"path": "/data02/projects/reprodicubility/reprodICU/reprodICU_files"
}
}
)
# Data source tracking
print(cohort.obs["data_source"].value_counts())
Source-Specific Variables#
Different sources may provide different variables. The variable loader automatically handles source selection:
# This variable might be available from CUB-HDP only
cohort.add_variable("apache_ii_score")
# Variables are automatically tagged with their source
print("Available sources for this variable:")
print(cohort.obsm["blood_pressure"]["data_source"].unique())
Variable Loader#
The variable loader provides the core functionality for loading variables from configured sources:
- corr_vars.sources.var_loader.load_variable(var_name, cohort, tmin=None, tmax=None)[source]#
Iterates through sources and gets variable from each source.
- Parameters:
var_name (str) – The name of the variable to load.
cohort (Cohort) – The cohort to load the variable for.
tmin (TimeBoundColumn | None) – Minimum time for extraction (default: None).
tmax (TimeBoundColumn | None) – Maximum time for extraction (default: None).
- Returns:
The loaded variable ready for extraction.
- Return type:
MultiSourceVariable