CUB-HDP Dummy Data Source#
The CUB-HDP Dummy source generates fully synthetic, reproducible clinical data that mirrors the structure and variable names of the real CUB-HDP Data Source source. No database connection is required, making it ideal for development, testing, and demonstration.
Overview#
cub_hdp_dummy produces plausible but entirely fake data by sampling from statistical distributions that approximate real ICU cohort characteristics. It shares the same variable catalogue as CUB-HDP, so code written against cub_hdp_dummy runs unchanged against the real source once a database connection is available.
When to use it:
Local development without VPN or database access
Unit and integration testing
Demonstrating CORR-Vars features to new users
Benchmarking and profiling pipelines
Configuration#
from corr_vars import Cohort
cohort = Cohort(
obs_level="icu_stay",
load_default_vars=False,
sources={
"cub_hdp_dummy": {
"size": 1000, # approximate number of rows to generate
"seed": 42, # random seed for reproducibility
}
}
)
Both keys are optional:
Key |
Default |
Description |
|---|---|---|
|
|
Approximate number of observations. Accepts an |
|
|
Integer random seed. Fix it for reproducible datasets; vary it to generate independent synthetic datasets. |
Generated Cohort Data#
The columns available in cohort.obs depend on obs_level:
|
Columns in |
|---|---|
|
|
|
All patient columns plus |
|
All hospital-stay columns plus |
Variables#
Static variables (e.g. age_on_admission, inhospital_death) are
available immediately after cohort creation. Dynamic variables reuse the same
variable names as CUB-HDP and are populated by DummyDynamic objects that
sample from configurable distributions.
# Default variables work out of the box
cohort = Cohort(
obs_level="icu_stay",
sources={"cub_hdp_dummy": {"size": 500, "seed": 0}},
)
cohort.add_variable("blood_sodium") # dynamic — synthetic time-series
cohort.add_variable("age_on_admission") # static — already in cohort.obs
Custom Synthetic Variables#
DummyDynamic lets you define custom synthetic time-series with full control
over the value distribution, measurement frequency, and missingness.
Numeric Variables#
from corr_vars.sources.cub_hdp_dummy import DummyDynamic
# Simulate hourly heart rate measurements (60–140 bpm, ~80 median)
heart_rate_sim = DummyDynamic(
var_name="heart_rate_sim",
data_type="numeric",
mean=80.0,
q1=68.0,
q3=92.0,
value_min=40.0,
value_max=200.0,
decimals=0,
density=24.0, # ~24 measurements per patient per day for patients with heart rate data
missingness=0.05, # 5 % of patients have no measurements
)
cohort.add_variable(heart_rate_sim)
Boolean Variables#
# Mechanical ventilation flag, present in 40 % of stays
ventilation_sim = DummyDynamic(
var_name="ventilation_sim",
data_type="boolean",
true_freq=0.4,
density=2.0,
missingness=0.1,
)
cohort.add_variable(ventilation_sim)
Categorical Variables#
# Sedation agent with weighted categories
sedation_sim = DummyDynamic(
var_name="sedation_agent_sim",
data_type="string",
categories={"Propofol": 0.55, "Midazolam": 0.25, "Dexmedetomidine": 0.20},
density=1.0,
missingness=0.6,
)
cohort.add_variable(sedation_sim)
Interval Variables#
Use observation_type="interval" for therapy-like variables that have both a
start and end time:
# Vasopressor infusion as an interval variable
vasopressor_sim = DummyDynamic(
var_name="vasopressor_sim",
data_type="boolean",
true_freq=0.8,
density=0.5,
missingness=0.7,
observation_type="interval",
interval_min_s=3600, # minimum 1 hour
interval_max_s=86400 * 3, # maximum 3 days
)
cohort.add_variable(vasopressor_sim)
Combining with Real Sources#
cub_hdp_dummy can be used as a drop-in replacement while developing locally,
and replaced with cub_hdp once a database connection is available:
import os
from corr_vars import Cohort
USE_DUMMY = not os.path.exists(os.path.expanduser("~/password.txt"))
sources = (
{"cub_hdp_dummy": {"size": 1000, "seed": 42}}
if USE_DUMMY
else {"cub_hdp": {"database": "db_hypercapnia_prepared", "conn_args": {"password_file": True}}}
)
cohort = Cohort(obs_level="icu_stay", sources=sources)