CUB-HDP Dummy Data Source#

The CUB-HDP Dummy source generates fully synthetic, reproducible clinical data that mirrors the structure and variable names of the real CUB-HDP Data Source source. No database connection is required, making it ideal for development, testing, and demonstration.

Overview#

cub_hdp_dummy produces plausible but entirely fake data by sampling from statistical distributions that approximate real ICU cohort characteristics. It shares the same variable catalogue as CUB-HDP, so code written against cub_hdp_dummy runs unchanged against the real source once a database connection is available.

When to use it:

  • Local development without VPN or database access

  • Unit and integration testing

  • Demonstrating CORR-Vars features to new users

  • Benchmarking and profiling pipelines

Configuration#

from corr_vars import Cohort

cohort = Cohort(
    obs_level="icu_stay",
    load_default_vars=False,
    sources={
        "cub_hdp_dummy": {
            "size": 1000,   # approximate number of rows to generate
            "seed": 42,     # random seed for reproducibility
        }
    }
)

Both keys are optional:

Key

Default

Description

size

100_000

Approximate number of observations. Accepts an int for an exact target or a tuple[int, int] for a random size drawn uniformly from the given range, e.g. (500, 2000).

seed

42

Integer random seed. Fix it for reproducible datasets; vary it to generate independent synthetic datasets.

Generated Cohort Data#

The columns available in cohort.obs depend on obs_level:

Columns by observation level#

obs_level

Columns in cohort.obs

"patient"

patient_id, age_on_admission, sex, birthdate, death_timestamp, inhospital_death

"hospital_stay"

All patient columns plus case_id, hospital_admission, hospital_discharge, hospital_status

"icu_stay"

All hospital-stay columns plus icu_stay_id, icu_admission, icu_discharge, icu_status, icu_id, icu_admission_diagnoses, bw_fach_oe, bw_patient_origin

Variables#

Static variables (e.g. age_on_admission, inhospital_death) are available immediately after cohort creation. Dynamic variables reuse the same variable names as CUB-HDP and are populated by DummyDynamic objects that sample from configurable distributions.

# Default variables work out of the box
cohort = Cohort(
    obs_level="icu_stay",
    sources={"cub_hdp_dummy": {"size": 500, "seed": 0}},
)
cohort.add_variable("blood_sodium")      # dynamic — synthetic time-series
cohort.add_variable("age_on_admission")  # static — already in cohort.obs

Custom Synthetic Variables#

DummyDynamic lets you define custom synthetic time-series with full control over the value distribution, measurement frequency, and missingness.

Numeric Variables#

from corr_vars.sources.cub_hdp_dummy import DummyDynamic

# Simulate hourly heart rate measurements (60–140 bpm, ~80 median)
heart_rate_sim = DummyDynamic(
    var_name="heart_rate_sim",
    data_type="numeric",
    mean=80.0,
    q1=68.0,
    q3=92.0,
    value_min=40.0,
    value_max=200.0,
    decimals=0,
    density=24.0,       # ~24 measurements per patient per day for patients with heart rate data
    missingness=0.05,   # 5 % of patients have no measurements
)
cohort.add_variable(heart_rate_sim)

Boolean Variables#

# Mechanical ventilation flag, present in 40 % of stays
ventilation_sim = DummyDynamic(
    var_name="ventilation_sim",
    data_type="boolean",
    true_freq=0.4,
    density=2.0,
    missingness=0.1,
)
cohort.add_variable(ventilation_sim)

Categorical Variables#

# Sedation agent with weighted categories
sedation_sim = DummyDynamic(
    var_name="sedation_agent_sim",
    data_type="string",
    categories={"Propofol": 0.55, "Midazolam": 0.25, "Dexmedetomidine": 0.20},
    density=1.0,
    missingness=0.6,
)
cohort.add_variable(sedation_sim)

Interval Variables#

Use observation_type="interval" for therapy-like variables that have both a start and end time:

# Vasopressor infusion as an interval variable
vasopressor_sim = DummyDynamic(
    var_name="vasopressor_sim",
    data_type="boolean",
    true_freq=0.8,
    density=0.5,
    missingness=0.7,
    observation_type="interval",
    interval_min_s=3600,       # minimum 1 hour
    interval_max_s=86400 * 3,  # maximum 3 days
)
cohort.add_variable(vasopressor_sim)

Combining with Real Sources#

cub_hdp_dummy can be used as a drop-in replacement while developing locally, and replaced with cub_hdp once a database connection is available:

import os
from corr_vars import Cohort

USE_DUMMY = not os.path.exists(os.path.expanduser("~/password.txt"))

sources = (
    {"cub_hdp_dummy": {"size": 1000, "seed": 42}}
    if USE_DUMMY
    else {"cub_hdp": {"database": "db_hypercapnia_prepared", "conn_args": {"password_file": True}}}
)

cohort = Cohort(obs_level="icu_stay", sources=sources)

Class Reference#

class corr_vars.sources.cub_hdp_dummy.extract.DummyDynamic(var_name, tmin=None, tmax=None, py=None, data_type=None, mean=None, value_min=None, value_max=None, q1=None, q3=None, decimals=None, missingness=0.0, density=None, true_freq=None, categories=None, observation_type='snapshot', interval_min_s=60, interval_max_s=86400)[source]#

Bases: Variable

Dummy dynamic variable that generates synthetic time-series data.

Supports four value types: numeric, boolean, string/categorical, datetime.

Patient selection and measurement frequency are controlled by two independent parameters:

  • missingness: fraction of patients who have no measurements at all (sampled once, randomly, before density is applied).

  • density: expected measurements per day for patients with data. Uses a zero-truncated Poisson so selected patients always get ≥ 1 row, and the count distribution is not distorted by the zero-inflation that a plain Poisson would introduce.

Parameters:
  • var_name (str) – Name of the variable.

  • tmin (str | tuple[str, str] | None) – Minimum time bound column.

  • tmax (str | tuple[str, str] | None) – Maximum time bound column.

  • py (VariableCallable | None) – Optional Python function to apply to the variable post-extraction.

  • data_type (Optional[Literal['numeric', 'boolean', 'string', 'datetime']]) – One of "numeric", "boolean", "string", "datetime".

  • mean (float | None) – Mean of the distribution (numeric / datetime in epoch seconds).

  • value_min (float | None) – Minimum value (numeric / datetime in epoch seconds).

  • value_max (float | None) – Maximum value (numeric / datetime in epoch seconds).

  • q1 (float | None) – First quartile (numeric / datetime in epoch seconds).

  • q3 (float | None) – Third quartile (numeric / datetime in epoch seconds).

  • decimals (int | None) – Round numeric values to this many decimal places (optional).

  • missingness (float) – Fraction of patients with no measurements (0.0–1.0). 0.0 → all patients have data; 1.0 → no patients have data. Applied as a Bernoulli draw per patient before density sampling. Default: 0.0.

  • density (float | None) – Expected measurements per patient per day for patients who have data. Each selected patient’s ZTP lambda = density × patient_duration_days, guaranteeing ≥ 1 measurement. Examples: 1.0 → daily, 0.143 → weekly, 24.0 → hourly.

  • true_freq (float | None) – Frequency of True values for data_type="boolean" (0–1). Defaults to 0.5.

  • categories (list[str] | dict[str, float] | None) –

    For data_type="string":

    • list[str]: categories sampled with uniform probability.

    • dict[str, float]: {category: weight} — weights may sum to 1.0 or 100; they are normalised automatically.

  • observation_type (Literal['snapshot', 'interval']) – "snapshot" — each row has a single recordtime. "interval" — each row also has a recordtime_end, derived by adding a random duration to recordtime.

  • interval_min_s (int) – Minimum interval duration in seconds (default: 60).

  • interval_max_s (int) – Maximum interval duration in seconds (default: 86 400).

extract(cohort)[source]#

Extracts data from the datasource. Usually follows this pattern.

# Load & Extract required variables
self._get_required_vars(cohort)

# This should change self.data either by returning or side effect
self.data = self._custom_extraction(cohort)

# Calls variable function to transform extracted data
self._call_var_function(cohort)

if self.dynamic:
    self._add_time_window(cohort)
    # Expects case_tmin, case_tmax for each primary key
    self._timefilter()
    self._apply_cleaning()
Parameters:

cohort (Cohort)

Return type:

DataFrame