CUB-HDP Dummy Data Source

CUB-HDP Dummy Data Source#

The CUB-HDP Dummy source generates fully synthetic, reproducible clinical data that mirrors the structure and variable names of the real CUB-HDP Data Source source. No database connection is required, making it ideal for development, testing, and demonstration.

Overview#

cub_hdp_dummy produces plausible but entirely fake data by sampling from statistical distributions that approximate real ICU cohort characteristics. It shares the same variable catalogue as CUB-HDP, so code written against cub_hdp_dummy runs unchanged against the real source once a database connection is available.

When to use it:

Local development without VPN or database access
Unit and integration testing
Demonstrating CORR-Vars features to new users
Benchmarking and profiling pipelines

Configuration#

from corr_vars import Cohort

cohort = Cohort(
    obs_level="icu_stay",
    load_default_vars=False,
    sources={
        "cub_hdp_dummy": {
            "size": 1000,   # approximate number of rows to generate
            "seed": 42,     # random seed for reproducibility
        }
    }
)

Both keys are optional:

Key	Default	Description
`size`	`100_000`	Approximate number of observations. Accepts an `int` for an exact target or a `tuple[int, int]` for a random size drawn uniformly from the given range, e.g. `(500, 2000)`.
`seed`	`42`	Integer random seed. Fix it for reproducible datasets; vary it to generate independent synthetic datasets.

Generated Cohort Data#

The columns available in cohort.obs depend on obs_level:

Columns by observation level#
`obs_level`	Columns in `cohort.obs`
`"patient"`	`patient_id`, `age_on_admission`, `sex`, `birthdate`, `death_timestamp`, `inhospital_death`
`"hospital_stay"`	All patient columns plus `case_id`, `hospital_admission`, `hospital_discharge`, `hospital_status`
`"icu_stay"`	All hospital-stay columns plus `icu_stay_id`, `icu_admission`, `icu_discharge`, `icu_status`, `icu_id`, `icu_admission_diagnoses`, `bw_fach_oe`, `bw_patient_origin`

Variables#

Static variables (e.g. age_on_admission, inhospital_death) are available immediately after cohort creation. Dynamic variables reuse the same variable names as CUB-HDP and are populated by DummyDynamic objects that sample from configurable distributions.

# Default variables work out of the box
cohort = Cohort(
    obs_level="icu_stay",
    sources={"cub_hdp_dummy": {"size": 500, "seed": 0}},
)
cohort.add_variable("blood_sodium")      # dynamic — synthetic time-series
cohort.add_variable("age_on_admission")  # static — already in cohort.obs

Custom Synthetic Variables#

DummyDynamic lets you define custom synthetic time-series with full control over the value distribution, measurement frequency, and missingness.

Numeric Variables#

from corr_vars.sources.cub_hdp_dummy import DummyDynamic

# Simulate hourly heart rate measurements (60–140 bpm, ~80 median)
heart_rate_sim = DummyDynamic(
    var_name="heart_rate_sim",
    data_type="numeric",
    mean=80.0,
    q1=68.0,
    q3=92.0,
    value_min=40.0,
    value_max=200.0,
    decimals=0,
    density=24.0,       # ~24 measurements per patient per day for patients with heart rate data
    missingness=0.05,   # 5 % of patients have no measurements
)
cohort.add_variable(heart_rate_sim)

Boolean Variables#

# Mechanical ventilation flag, present in 40 % of stays
ventilation_sim = DummyDynamic(
    var_name="ventilation_sim",
    data_type="boolean",
    true_freq=0.4,
    density=2.0,
    missingness=0.1,
)
cohort.add_variable(ventilation_sim)

Categorical Variables#

# Sedation agent with weighted categories
sedation_sim = DummyDynamic(
    var_name="sedation_agent_sim",
    data_type="string",
    categories={"Propofol": 0.55, "Midazolam": 0.25, "Dexmedetomidine": 0.20},
    density=1.0,
    missingness=0.6,
)
cohort.add_variable(sedation_sim)

Interval Variables#

Use observation_type="interval" for therapy-like variables that have both a start and end time:

# Vasopressor infusion as an interval variable
vasopressor_sim = DummyDynamic(
    var_name="vasopressor_sim",
    data_type="boolean",
    true_freq=0.8,
    density=0.5,
    missingness=0.7,
    observation_type="interval",
    interval_min_s=3600,       # minimum 1 hour
    interval_max_s=86400 * 3,  # maximum 3 days
)
cohort.add_variable(vasopressor_sim)

Combining with Real Sources#

cub_hdp_dummy can be used as a drop-in replacement while developing locally, and replaced with cub_hdp once a database connection is available:

import os
from corr_vars import Cohort

USE_DUMMY = not os.path.exists(os.path.expanduser("~/password.txt"))

sources = (
    {"cub_hdp_dummy": {"size": 1000, "seed": 42}}
    if USE_DUMMY
    else {"cub_hdp": {"database": "db_hypercapnia_prepared", "conn_args": {"password_file": True}}}
)

cohort = Cohort(obs_level="icu_stay", sources=sources)

Class Reference#

class corr_vars.sources.cub_hdp_dummy.extract.DummyDynamic(var_name, tmin=None, tmax=None, py=None, data_type=None, mean=None, value_min=None, value_max=None, q1=None, q3=None, decimals=None, missingness=0.0, density=None, true_freq=None, categories=None, observation_type='snapshot', interval_min_s=60, interval_max_s=86400)[source]#

Bases: Variable

Dummy dynamic variable that generates synthetic time-series data.

Supports four value types: numeric, boolean, string/categorical, datetime.

Patient selection and measurement frequency are controlled by two independent parameters:

missingness: fraction of patients who have no measurements at all (sampled once, randomly, before density is applied).
density: expected measurements per day for patients with data. Uses a zero-truncated Poisson so selected patients always get ≥ 1 row, and the count distribution is not distorted by the zero-inflation that a plain Poisson would introduce.

Parameters:

var_name (str) – Name of the variable.
tmin (str | tuple[str, str] | None) – Minimum time bound column.
tmax (str | tuple[str, str] | None) – Maximum time bound column.
py (VariableCallable | None) – Optional Python function to apply to the variable post-extraction.
data_type (Optional[Literal['numeric', 'boolean', 'string', 'datetime']]) – One of "numeric", "boolean", "string", "datetime".
mean (float | None) – Mean of the distribution (numeric / datetime in epoch seconds).
value_min (float | None) – Minimum value (numeric / datetime in epoch seconds).
value_max (float | None) – Maximum value (numeric / datetime in epoch seconds).
q1 (float | None) – First quartile (numeric / datetime in epoch seconds).
q3 (float | None) – Third quartile (numeric / datetime in epoch seconds).
decimals (int | None) – Round numeric values to this many decimal places (optional).
missingness (float) – Fraction of patients with no measurements (0.0–1.0). 0.0 → all patients have data; 1.0 → no patients have data. Applied as a Bernoulli draw per patient before density sampling. Default: 0.0.
density (float | None) – Expected measurements per patient per day for patients who have data. Each selected patient’s ZTP lambda = density × patient_duration_days, guaranteeing ≥ 1 measurement. Examples: 1.0 → daily, 0.143 → weekly, 24.0 → hourly.
true_freq (float | None) – Frequency of True values for data_type="boolean" (0–1). Defaults to 0.5.
categories (list[str] | dict[str, float] | None) –
For data_type="string":
- list[str]: categories sampled with uniform probability.
- dict[str, float]: {category: weight} — weights may sum to 1.0 or 100; they are normalised automatically.
observation_type (Literal['snapshot', 'interval']) – "snapshot" — each row has a single recordtime. "interval" — each row also has a recordtime_end, derived by adding a random duration to recordtime.
interval_min_s (int) – Minimum interval duration in seconds (default: 60).
interval_max_s (int) – Maximum interval duration in seconds (default: 86 400).

extract(cohort)[source]#

Extracts data from the datasource. Usually follows this pattern.

# Load & Extract required variables
self._get_required_vars(cohort)

# This should change self.data either by returning or side effect
self.data = self._custom_extraction(cohort)

# Calls variable function to transform extracted data
self._call_var_function(cohort)

if self.dynamic:
    self._add_time_window(cohort)
    # Expects case_tmin, case_tmax for each primary key
    self._timefilter()
    self._apply_cleaning()

Parameters:: cohort (Cohort)
Return type:: DataFrame