CUB-HDP Dummy Data Source#
The CUB-HDP Dummy source generates fully synthetic, reproducible clinical data that mirrors the structure and variable names of the real CUB-HDP Data Source source. No database connection is required, making it ideal for development, testing, and demonstration.
Overview#
cub_hdp_dummy produces plausible but entirely fake data by sampling from statistical distributions that approximate real ICU cohort characteristics. It shares the same variable catalogue as CUB-HDP, so code written against cub_hdp_dummy runs unchanged against the real source once a database connection is available.
When to use it:
Local development without VPN or database access
Unit and integration testing
Demonstrating CORR-Vars features to new users
Benchmarking and profiling pipelines
Configuration#
from corr_vars import Cohort
cohort = Cohort(
obs_level="icu_stay",
load_default_vars=False,
sources={
"cub_hdp_dummy": {
"size": 1000, # approximate number of rows to generate
"seed": 42, # random seed for reproducibility
}
}
)
Both keys are optional:
Key |
Default |
Description |
|---|---|---|
|
|
Approximate number of observations. Accepts an |
|
|
Integer random seed. Fix it for reproducible datasets; vary it to generate independent synthetic datasets. |
Generated Cohort Data#
The columns available in cohort.obs depend on obs_level:
|
Columns in |
|---|---|
|
|
|
All patient columns plus |
|
All hospital-stay columns plus |
Variables#
Static variables (e.g. age_on_admission, inhospital_death) are
available immediately after cohort creation. Dynamic variables reuse the same
variable names as CUB-HDP and are populated by DummyDynamic objects that
sample from configurable distributions.
# Default variables work out of the box
cohort = Cohort(
obs_level="icu_stay",
sources={"cub_hdp_dummy": {"size": 500, "seed": 0}},
)
cohort.add_variable("blood_sodium") # dynamic — synthetic time-series
cohort.add_variable("age_on_admission") # static — already in cohort.obs
Custom Synthetic Variables#
DummyDynamic lets you define custom synthetic time-series with full control
over the value distribution, measurement frequency, and missingness.
Numeric Variables#
from corr_vars.sources.cub_hdp_dummy import DummyDynamic
# Simulate hourly heart rate measurements (60–140 bpm, ~80 median)
heart_rate_sim = DummyDynamic(
var_name="heart_rate_sim",
data_type="numeric",
mean=80.0,
q1=68.0,
q3=92.0,
value_min=40.0,
value_max=200.0,
decimals=0,
density=24.0, # ~24 measurements per patient per day for patients with heart rate data
missingness=0.05, # 5 % of patients have no measurements
)
cohort.add_variable(heart_rate_sim)
Boolean Variables#
# Mechanical ventilation flag, present in 40 % of stays
ventilation_sim = DummyDynamic(
var_name="ventilation_sim",
data_type="boolean",
true_freq=0.4,
density=2.0,
missingness=0.1,
)
cohort.add_variable(ventilation_sim)
Categorical Variables#
# Sedation agent with weighted categories
sedation_sim = DummyDynamic(
var_name="sedation_agent_sim",
data_type="string",
categories={"Propofol": 0.55, "Midazolam": 0.25, "Dexmedetomidine": 0.20},
density=1.0,
missingness=0.6,
)
cohort.add_variable(sedation_sim)
Interval Variables#
Use observation_type="interval" for therapy-like variables that have both a
start and end time:
# Vasopressor infusion as an interval variable
vasopressor_sim = DummyDynamic(
var_name="vasopressor_sim",
data_type="boolean",
true_freq=0.8,
density=0.5,
missingness=0.7,
observation_type="interval",
interval_min_s=3600, # minimum 1 hour
interval_max_s=86400 * 3, # maximum 3 days
)
cohort.add_variable(vasopressor_sim)
Combining with Real Sources#
cub_hdp_dummy can be used as a drop-in replacement while developing locally,
and replaced with cub_hdp once a database connection is available:
import os
from corr_vars import Cohort
USE_DUMMY = not os.path.exists(os.path.expanduser("~/password.txt"))
sources = (
{"cub_hdp_dummy": {"size": 1000, "seed": 42}}
if USE_DUMMY
else {"cub_hdp": {"database": "db_hypercapnia_prepared", "conn_args": {"password_file": True}}}
)
cohort = Cohort(obs_level="icu_stay", sources=sources)
Class Reference#
- class corr_vars.sources.cub_hdp_dummy.extract.DummyDynamic(var_name, tmin=None, tmax=None, py=None, data_type=None, mean=None, value_min=None, value_max=None, q1=None, q3=None, decimals=None, missingness=0.0, density=None, true_freq=None, categories=None, observation_type='snapshot', interval_min_s=60, interval_max_s=86400)[source]#
Bases:
VariableDummy dynamic variable that generates synthetic time-series data.
Supports four value types: numeric, boolean, string/categorical, datetime.
Patient selection and measurement frequency are controlled by two independent parameters:
missingness: fraction of patients who have no measurements at all (sampled once, randomly, before density is applied).density: expected measurements per day for patients with data. Uses a zero-truncated Poisson so selected patients always get ≥ 1 row, and the count distribution is not distorted by the zero-inflation that a plain Poisson would introduce.
- Parameters:
var_name (
str) – Name of the variable.tmin (
str|tuple[str,str] |None) – Minimum time bound column.tmax (
str|tuple[str,str] |None) – Maximum time bound column.py (
VariableCallable|None) – Optional Python function to apply to the variable post-extraction.data_type (
Optional[Literal['numeric','boolean','string','datetime']]) – One of"numeric","boolean","string","datetime".mean (
float|None) – Mean of the distribution (numeric / datetime in epoch seconds).value_min (
float|None) – Minimum value (numeric / datetime in epoch seconds).value_max (
float|None) – Maximum value (numeric / datetime in epoch seconds).q1 (
float|None) – First quartile (numeric / datetime in epoch seconds).q3 (
float|None) – Third quartile (numeric / datetime in epoch seconds).decimals (
int|None) – Round numeric values to this many decimal places (optional).missingness (
float) – Fraction of patients with no measurements (0.0–1.0). 0.0 → all patients have data; 1.0 → no patients have data. Applied as a Bernoulli draw per patient before density sampling. Default: 0.0.density (
float|None) – Expected measurements per patient per day for patients who have data. Each selected patient’s ZTP lambda = density × patient_duration_days, guaranteeing ≥ 1 measurement. Examples: 1.0 → daily, 0.143 → weekly, 24.0 → hourly.true_freq (
float|None) – Frequency ofTruevalues fordata_type="boolean"(0–1). Defaults to 0.5.categories (
list[str] |dict[str,float] |None) –For
data_type="string":list[str]: categories sampled with uniform probability.dict[str, float]:{category: weight}— weights may sum to 1.0 or 100; they are normalised automatically.
observation_type (
Literal['snapshot','interval']) –"snapshot"— each row has a singlerecordtime."interval"— each row also has arecordtime_end, derived by adding a random duration torecordtime.interval_min_s (
int) – Minimum interval duration in seconds (default: 60).interval_max_s (
int) – Maximum interval duration in seconds (default: 86 400).
- extract(cohort)[source]#
Extracts data from the datasource. Usually follows this pattern.
# Load & Extract required variables self._get_required_vars(cohort) # This should change self.data either by returning or side effect self.data = self._custom_extraction(cohort) # Calls variable function to transform extracted data self._call_var_function(cohort) if self.dynamic: self._add_time_window(cohort) # Expects case_tmin, case_tmax for each primary key self._timefilter() self._apply_cleaning()
- Parameters:
cohort (
Cohort)- Return type:
DataFrame