CUB-HDP Dummy Data Source#

The CUB-HDP Dummy source generates fully synthetic, reproducible clinical data that mirrors the structure and variable names of the real CUB-HDP Data Source source. No database connection is required, making it ideal for development, testing, and demonstration.

Overview#

cub_hdp_dummy produces plausible but entirely fake data by sampling from statistical distributions that approximate real ICU cohort characteristics. It shares the same variable catalogue as CUB-HDP, so code written against cub_hdp_dummy runs unchanged against the real source once a database connection is available.

When to use it:

  • Local development without VPN or database access

  • Unit and integration testing

  • Demonstrating CORR-Vars features to new users

  • Benchmarking and profiling pipelines

Configuration#

from corr_vars import Cohort

cohort = Cohort(
    obs_level="icu_stay",
    load_default_vars=False,
    sources={
        "cub_hdp_dummy": {
            "size": 1000,   # approximate number of rows to generate
            "seed": 42,     # random seed for reproducibility
        }
    }
)

Both keys are optional:

Key

Default

Description

size

100_000

Approximate number of observations. Accepts an int for an exact target or a tuple[int, int] for a random size drawn uniformly from the given range, e.g. (500, 2000).

seed

42

Integer random seed. Fix it for reproducible datasets; vary it to generate independent synthetic datasets.

Generated Cohort Data#

The columns available in cohort.obs depend on obs_level:

Columns by observation level#

obs_level

Columns in cohort.obs

"patient"

patient_id, age_on_admission, sex, birthdate, death_timestamp, inhospital_death

"hospital_stay"

All patient columns plus case_id, hospital_admission, hospital_discharge, hospital_status

"icu_stay"

All hospital-stay columns plus icu_stay_id, icu_admission, icu_discharge, icu_status, icu_id, icu_admission_diagnoses, bw_fach_oe, bw_patient_origin

Variables#

Static variables (e.g. age_on_admission, inhospital_death) are available immediately after cohort creation. Dynamic variables reuse the same variable names as CUB-HDP and are populated by DummyDynamic objects that sample from configurable distributions.

# Default variables work out of the box
cohort = Cohort(
    obs_level="icu_stay",
    sources={"cub_hdp_dummy": {"size": 500, "seed": 0}},
)
cohort.add_variable("blood_sodium")      # dynamic — synthetic time-series
cohort.add_variable("age_on_admission")  # static — already in cohort.obs

Custom Synthetic Variables#

DummyDynamic lets you define custom synthetic time-series with full control over the value distribution, measurement frequency, and missingness.

Numeric Variables#

from corr_vars.sources.cub_hdp_dummy import DummyDynamic

# Simulate hourly heart rate measurements (60–140 bpm, ~80 median)
heart_rate_sim = DummyDynamic(
    var_name="heart_rate_sim",
    data_type="numeric",
    mean=80.0,
    q1=68.0,
    q3=92.0,
    value_min=40.0,
    value_max=200.0,
    decimals=0,
    density=24.0,       # ~24 measurements per patient per day for patients with heart rate data
    missingness=0.05,   # 5 % of patients have no measurements
)
cohort.add_variable(heart_rate_sim)

Boolean Variables#

# Mechanical ventilation flag, present in 40 % of stays
ventilation_sim = DummyDynamic(
    var_name="ventilation_sim",
    data_type="boolean",
    true_freq=0.4,
    density=2.0,
    missingness=0.1,
)
cohort.add_variable(ventilation_sim)

Categorical Variables#

# Sedation agent with weighted categories
sedation_sim = DummyDynamic(
    var_name="sedation_agent_sim",
    data_type="string",
    categories={"Propofol": 0.55, "Midazolam": 0.25, "Dexmedetomidine": 0.20},
    density=1.0,
    missingness=0.6,
)
cohort.add_variable(sedation_sim)

Interval Variables#

Use observation_type="interval" for therapy-like variables that have both a start and end time:

# Vasopressor infusion as an interval variable
vasopressor_sim = DummyDynamic(
    var_name="vasopressor_sim",
    data_type="boolean",
    true_freq=0.8,
    density=0.5,
    missingness=0.7,
    observation_type="interval",
    interval_min_s=3600,       # minimum 1 hour
    interval_max_s=86400 * 3,  # maximum 3 days
)
cohort.add_variable(vasopressor_sim)

Combining with Real Sources#

cub_hdp_dummy can be used as a drop-in replacement while developing locally, and replaced with cub_hdp once a database connection is available:

import os
from corr_vars import Cohort

USE_DUMMY = not os.path.exists(os.path.expanduser("~/password.txt"))

sources = (
    {"cub_hdp_dummy": {"size": 1000, "seed": 42}}
    if USE_DUMMY
    else {"cub_hdp": {"database": "db_hypercapnia_prepared", "conn_args": {"password_file": True}}}
)

cohort = Cohort(obs_level="icu_stay", sources=sources)

Class Reference#