CUB-HDP Dummy Data Source
=========================

The CUB-HDP Dummy source generates fully synthetic, reproducible clinical data that mirrors the structure and variable names of the real :doc:`cub_hdp` source. No database connection is required, making it ideal for development, testing, and demonstration.

Overview
--------

``cub_hdp_dummy`` produces plausible but entirely fake data by sampling from statistical distributions that approximate real ICU cohort characteristics. It shares the same variable catalogue as CUB-HDP, so code written against ``cub_hdp_dummy`` runs unchanged against the real source once a database connection is available.

**When to use it:**

- Local development without VPN or database access
- Unit and integration testing
- Demonstrating CORR-Vars features to new users
- Benchmarking and profiling pipelines

Configuration
-------------

.. code-block:: python

    from corr_vars import Cohort

    cohort = Cohort(
        obs_level="icu_stay",
        load_default_vars=False,
        sources={
            "cub_hdp_dummy": {
                "size": 1000,   # approximate number of rows to generate
                "seed": 42,     # random seed for reproducibility
            }
        }
    )

Both keys are optional:

.. list-table::
   :header-rows: 1
   :widths: 15 15 70

   * - Key
     - Default
     - Description
   * - ``size``
     - ``100_000``
     - Approximate number of observations. Accepts an ``int`` for an exact
       target or a ``tuple[int, int]`` for a random size drawn uniformly
       from the given range, e.g. ``(500, 2000)``.
   * - ``seed``
     - ``42``
     - Integer random seed. Fix it for reproducible datasets; vary it to
       generate independent synthetic datasets.

Generated Cohort Data
---------------------

The columns available in ``cohort.obs`` depend on ``obs_level``:

.. list-table:: Columns by observation level
   :header-rows: 1
   :widths: 20 80

   * - ``obs_level``
     - Columns in ``cohort.obs``
   * - ``"patient"``
     - ``patient_id``, ``age_on_admission``, ``sex``, ``birthdate``,
       ``death_timestamp``, ``inhospital_death``
   * - ``"hospital_stay"``
     - All patient columns plus ``case_id``, ``hospital_admission``,
       ``hospital_discharge``, ``hospital_status``
   * - ``"icu_stay"``
     - All hospital-stay columns plus ``icu_stay_id``, ``icu_admission``,
       ``icu_discharge``, ``icu_status``, ``icu_id``,
       ``icu_admission_diagnoses``, ``bw_fach_oe``, ``bw_patient_origin``

Variables
---------

Static variables (e.g. ``age_on_admission``, ``inhospital_death``) are
available immediately after cohort creation. Dynamic variables reuse the same
variable names as CUB-HDP and are populated by ``DummyDynamic`` objects that
sample from configurable distributions.

.. code-block:: python

    # Default variables work out of the box
    cohort = Cohort(
        obs_level="icu_stay",
        sources={"cub_hdp_dummy": {"size": 500, "seed": 0}},
    )
    cohort.add_variable("blood_sodium")      # dynamic — synthetic time-series
    cohort.add_variable("age_on_admission")  # static — already in cohort.obs

Custom Synthetic Variables
--------------------------

``DummyDynamic`` lets you define custom synthetic time-series with full control
over the value distribution, measurement frequency, and missingness.

Numeric Variables
^^^^^^^^^^^^^^^^^

.. code-block:: python

    from corr_vars.sources.cub_hdp_dummy import DummyDynamic

    # Simulate hourly heart rate measurements (60–140 bpm, ~80 median)
    heart_rate_sim = DummyDynamic(
        var_name="heart_rate_sim",
        data_type="numeric",
        mean=80.0,
        q1=68.0,
        q3=92.0,
        value_min=40.0,
        value_max=200.0,
        decimals=0,
        density=24.0,       # ~24 measurements per patient per day for patients with heart rate data
        missingness=0.05,   # 5 % of patients have no measurements
    )
    cohort.add_variable(heart_rate_sim)

Boolean Variables
^^^^^^^^^^^^^^^^^

.. code-block:: python

    # Mechanical ventilation flag, present in 40 % of stays
    ventilation_sim = DummyDynamic(
        var_name="ventilation_sim",
        data_type="boolean",
        true_freq=0.4,
        density=2.0,
        missingness=0.1,
    )
    cohort.add_variable(ventilation_sim)

Categorical Variables
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

    # Sedation agent with weighted categories
    sedation_sim = DummyDynamic(
        var_name="sedation_agent_sim",
        data_type="string",
        categories={"Propofol": 0.55, "Midazolam": 0.25, "Dexmedetomidine": 0.20},
        density=1.0,
        missingness=0.6,
    )
    cohort.add_variable(sedation_sim)

Interval Variables
^^^^^^^^^^^^^^^^^^

Use ``observation_type="interval"`` for therapy-like variables that have both a
start and end time:

.. code-block:: python

    # Vasopressor infusion as an interval variable
    vasopressor_sim = DummyDynamic(
        var_name="vasopressor_sim",
        data_type="boolean",
        true_freq=0.8,
        density=0.5,
        missingness=0.7,
        observation_type="interval",
        interval_min_s=3600,       # minimum 1 hour
        interval_max_s=86400 * 3,  # maximum 3 days
    )
    cohort.add_variable(vasopressor_sim)

Combining with Real Sources
---------------------------

``cub_hdp_dummy`` can be used as a drop-in replacement while developing locally,
and replaced with ``cub_hdp`` once a database connection is available:

.. code-block:: python

    import os
    from corr_vars import Cohort

    USE_DUMMY = not os.path.exists(os.path.expanduser("~/password.txt"))

    sources = (
        {"cub_hdp_dummy": {"size": 1000, "seed": 42}}
        if USE_DUMMY
        else {"cub_hdp": {"database": "db_hypercapnia_prepared", "conn_args": {"password_file": True}}}
    )

    cohort = Cohort(obs_level="icu_stay", sources=sources)

Class Reference
---------------

.. currentmodule:: corr_vars.sources.cub_hdp_dummy.extract

.. autoclass:: DummyDynamic
   :members:
   :undoc-members:
   :show-inheritance:
