Cohort#

Cohort Workflow#

Initialization#

With this snippet, you can initialize a cohort object using the CUB-HDP data source.

cohort = Cohort(
    obs_level="icu_stay",     # One of: "patient", "hospital_stay", "icu_stay", "procedure"
    load_default_vars=False,  # Optional, defaults to True
    sources={
        "cub_hdp": {
            "database": "db_hypercapnia_prepared",
            "conn_args": {"password_file": True},  # True assumes ~/password.txt
            "icu_stay": {"merge_consecutive": False},
        }
    },
)

Specify multiple data sources to combine cohorts.

cohort = Cohort(
    obs_level="icu_stay",     # One of: "patient", "hospital_stay", "icu_stay", "procedure"
    load_default_vars=False,  # Optional, defaults to True
    sources={
        "cub_hdp": {
            "database": "db_hypercapnia_prepared",
            "conn_args": {"password_file": True},  # True assumes ~/password.txt
            "icu_stay": {"merge_consecutive": False},
        },
        "reprodicu": {
            "path": "/data02/projects/reprodicubility/reprodICU/reprodICU_files"
        }
    },
)

Accessing Data#

After initialization, cohort data is available through two attributes:

  • cohort.obs — a Polars DataFrame with one row per observation (static variables, demographics, outcomes).

  • cohort.obsm — a dictionary of Polars DataFrames, one per dynamic (time-series) variable.

# Inspect static data
print(cohort.obs)
cohort.obs.select("age_on_admission")
cohort.obs.filter(pl.col("sex") == "M")

# Access a dynamic variable
print(cohort.obsm["blood_sodium"])

# Filter time-series data for a specific observation
cohort.obsm["blood_sodium"].filter(
    pl.col(cohort.primary_key) == "12345"
)

# Check which dynamic variables have been extracted
print(list(cohort.obsm.keys()))

You can assign a new DataFrame back to cohort.obs to add computed columns:

cohort.obs = cohort.obs.with_columns([
    pl.col('first_sodium_recordtime').eq(pl.col('first_severe_hypernatremia_recordtime'))
    .alias('idx_hypernatremia_was_on_admission')
])

cohort.obs = cohort.obs.with_columns([
    pl.when(pl.col('idx_hypernatremia_was_on_admission'))
    .then(pl.lit('community_acquired'))
    .otherwise(pl.lit('hospital_acquired'))
    .alias('hn_origin')
])

Adding Variables#

# Use pre-defined variables
cohort.add_variable("pf_ratio")

# Load a variable with custom time bounds
cohort.add_variable(
    variable="anx_dx_covid_19",
    tmin=("hospital_admission", "-1d"),
    tmax=cohort.t_eligible
)

# Create a custom variable on the fly
cohort.add_variable(
    NativeStatic(
        var_name="median_sodium_before_hn",
        select="!median value",
        base_var="blood_sodium",
        tmin="hospital_admission",
        tmax=cohort.t_eligible
    )
)

# Save a variable under a different name
cohort.add_variable(
    variable="any_med_glu",
    save_as="glucose_prior_eligible",
    tmin=(cohort.t_eligible, "-48h"),
    tmax=cohort.t_eligible,
)

For large cohorts it is faster to apply filters before loading default variables:

cohort = Cohort(obs_level="icu_stay", load_default_vars=False, ...)

# Filter first ...
cohort.include_list([
    {"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"}
])

# ... then load default variables on the reduced cohort
cohort.load_default_vars()

Pass project_vars at construction time to register definitions before any variables are loaded:

cohort = Cohort(
    obs_level="icu_stay",
    sources={"cub_hdp": {"database": "db_hypercapnia_prepared",
                         "conn_args": {"password_file": True}}},
    project_vars={
        "my_new_var": {
            "type": "native_dynamic",
            "table": "it_ishmed_labor",
            "where": "c_katalog_leistungtext LIKE '%new%'",
            "value_dtype": "DOUBLE",
            "cleaning": {"value": {"low": 100, "high": 150}},
        },
        "blood_sodium": {
            "where": "c_katalog_leistungtext LIKE '%custom_sodium%'"
        },
    },
)

To add or override a variable definition at runtime without editing vars.json, use add_variable_definition(). Use get_variable_definition() to inspect the active definition (merged local override + global defaults) per source:

# Register a brand-new variable
cohort.add_variable_definition("my_new_var", {
    "type": "native_dynamic",
    "table": "it_ishmed_labor",
    "where": "c_katalog_leistungtext LIKE '%new%'",
    "value_dtype": "DOUBLE",
    "cleaning": {"value": {"low": 100, "high": 150}},
})

# Partially override an existing variable (merges with global definition)
cohort.add_variable_definition("blood_sodium", {
    "where": "c_katalog_leistungtext LIKE '%custom_sodium%'"
})

# Inspect the resolved definition per data source
defn = cohort.get_variable_definition("blood_sodium")
# {"cub_hdp": {"where": "c_katalog_leistungtext LIKE '%custom_sodium%'"}}

Time Anchors#

t_eligible marks the earliest timepoint a patient is eligible for the study; t_outcome marks the primary outcome timepoint. Both default to observation-level column names (e.g. icu_admission / icu_discharge) but should be overridden for most study designs.

# Extract the first SpO2 < 90 % event as the eligibility anchor
cohort.add_variable(NativeStatic(
    var_name="spo2_lt_90",
    base_var="spo2",
    select="!first recordtime",
    where="value < 90",
))
cohort.set_t_eligible("spo2_lt_90")    # also drops rows where the anchor is null

# Set the outcome anchor (no rows are dropped)
cohort.set_t_outcome("hospital_discharge")

# t_eligible / t_outcome can now be used as tmax/tmin elsewhere
cohort.add_variable("blood_sodium", tmax=cohort.t_eligible)

Inclusion/Exclusion#

# Add multiple inclusion criteria at once
cohort.include_list([
    {
        "variable": "age_on_admission",
        "operation": ">= 18",
        "label": "Adult patients"
    },
    {
        "variable": "icu_length_of_stay",
        "operation": "> 2",
        "label": "ICU stay > 2 days"
    }
])

# Add exclusion criteria at once
cohort.exclude_list([
    {
        "variable": "any_dx_covid_19",
        "operation": "== True",
        "label": "Exclude COVID-19 patients"
    }
])

Use include() / exclude() to apply a single criterion at a later stage:

cohort.include(
    variable="age_on_admission",
    operation=">= 18",
    label="Adult",
    operations_done="Include only adult patients",
)

cohort.exclude(
    variable="elix_total",
    operation="> 20",
    operations_done="Exclude patients with high Elixhauser score",
)

For grouped criteria that should appear as a single step in the flowchart, use the change_tracker() context manager:

with cohort.change_tracker("Adults", mode="include") as track:
    track.filter(pl.col("age_on_admission") >= 18)

Exploration#

# Create a TableOne summary
tableone = cohort.to_tableone(ignore_cols=["icu_id"])
print(tableone)

# Grouped TableOne (e.g. by sex)
tableone = cohort.to_tableone(groupby="sex", pval=True)
tableone.to_csv("tableone_sex.csv")

# Interactive Jupyter widget for obs
cohort.widget                          # renders the full obs DataFrame
cohort.to_widget("age_on_admission", "sex")  # select specific columns

# Interactive widget for dynamic variables
cohort.obsm.widget                     # renders all obsm variables
cohort.obsm.to_widget("blood_sodium")  # renders a single variable

# Inclusion / exclusion flowchart
cohort.figureone                       # returns a graphviz.Digraph object
cohort.to_figureone()                  # returns a graphviz.Digraph object

# Print debug information (helpful when filing a GitHub issue)
cohort.debug_print()

Data Export#

# Save to CORR archive (recommended)
cohort.save("my_cohort.corr3")

# Load from file (.corr2 and .corr3 supported)
cohort = Cohort.load("my_cohort.corr3")

# Export obs + all obsm variables as individual CSV files
cohort.to_csv("path/to/output_folder")

# Export obs + all obsm variables as individual Parquet files
cohort.to_parquet("path/to/output_folder")

# Convert obs to a Stata-compatible pandas DataFrame
stata_df = cohort.to_stata()

# Save obs directly as a .dta Stata file
cohort.to_stata(to_file="path/to/output_folder/my_cohort.dta")

Class Reference#

class corr_vars.core.cohort.Cohort(obs_level='icu_stay', sources={'cub_hdp': {'conn_args': {'password_file': True}, 'database': 'db_hypercapnia_prepared', 'icu_stay': {'merge_consecutive': True}}}, project_vars={}, load_default_vars=True, logger_args={})[source]#

Bases: object

Class to build a cohort in the CORR database.

Parameters:
  • obs_level (Literal["patient", "hospital_stay", "icu_stay", "procedure"]) –

    Observation level (default: “icu_stay”).

    • "patient" gives one row per patient (primary key: patient_id)

    • "hospital_stay" per hospitalisation (case_id)

    • "icu_stay" per ICU admission (icu_stay_id)

    • "procedure" per surgical procedure (procedure_id).

  • sources (dict[str, dict]) –

    Dictionary of data sources to use for data extraction. Available options are “cub_hdp”, “cub_hdp_dummy”, “reprodicu”. Source configurations:

    • cub_hdp: database, conn_args (password_file, remote_hostname, username), icu_stay (merge_consecutive, extra_columns), filter (extraction_start_date, extraction_end_date, additional_filters, include_adults_only, exclude_dhzb, exclude_brain_death)

    • cub_hdp_dummy: size, seed

    • reprodicu: path, exclude_datasets, include_datasets

    Note: reprodicu does not yet implement variable extraction, only cohort data.

  • project_vars (dict) – Dictionary with local variable definitions (default: {}).

  • load_default_vars (bool) – Whether to load the default variables (default: True).

  • logger_args (dict) – Dictionary of Logging configurations [level (int), file_path (str), file_mode (str), verbose_fmt (bool), colored_output (bool), formatted_numbers (bool)] (default: {}).

obs#

Static data for each observation. Contains one row per observation (e.g., ICU stay) with columns for static variables like demographics and outcomes.

Example

>>> cohort.obs
patient_id  case_id icu_stay_id            icu_admission        icu_discharge sex   ... inhospital_death
0  P001         C001    C001_1       2023-01-01 08:30:00  2023-01-03 12:00:00   M   ...  False
1  P001         C001    C001_2       2023-01-03 14:20:00  2023-01-05 16:30:00   M   ...  False
2  P002         C002    C002_1       2023-01-02 09:15:00  2023-01-04 10:30:00   F   ...  False
3  P003         C003    C003_1       2023-01-04 11:45:00  2023-01-07 13:20:00   F   ...  True
...
Type:

pl.DataFrame

obsm#

Dynamic data stored as dictionary of DataFrames. Each DataFrame contains time-series data for a variable with columns:

  • recordtime: Timestamp of the measurement

  • value: Value of the measurement

  • recordtime_end: End time (only for duration-based variables like therapies)

  • description: Additional information (e.g., medication names)

Example

>>> cohort.obsm["blood_sodium"]
   icu_stay_id          recordtime  value
0  C001_1      2023-01-01 09:30:00   138
1  C001_1      2023-01-02 10:15:00   141
2  C001_2      2023-01-03 15:00:00   137
3  C002_1      2023-01-02 10:00:00   142
4  C003_1      2023-01-04 12:30:00   139
...
Type:

dict of pl.DataFrame

Notes

  • For large cohorts, set load_default_vars=False to speed up the extraction. You can use pre-extracted cohorts as starting points and load them using Cohort.load().

  • Variables can be added using cohort.add_variable(). Static variables will be added to obs, dynamic variables to obsm.

  • For quick prototyping, use sources["cub_hdp"]["filter"]["additional_filters"] with a "_dx" shorthand (e.g. "_d2" for the last 2 months), or use "cub_hdp_dummy" for fully synthetic data.

Examples

Create a new cohort:

>>> cohort = Cohort(
...                 obs_level="icu_stay",
...                 load_default_vars=False,
...                 sources={
...                     "cub_hdp": {
...                         "database": "db_hypercapnia_prepared",
...                         "conn_args": {"password_file": True},
...                         "icu_stay": {"merge_consecutive": False}},
...                     "reprodicu": {
...                         "path": "/data02/projects/reprodicubility/reprodICU/reprodICU_files"}
...                 })

Access static data:

>>> cohort.obs.select("age_on_admission")  # Get age for all patients
>>> cohort.obs.filter(pl.col("sex") == "M")  # Filter for male patients

Access time-series data:

>>> cohort.obsm["blood_sodium"]  # Get all blood sodium measurements
>>> # Get blood sodium measurements for a specific observation
>>> cohort.obsm["blood_sodium"].filter(pl.col(cohort.primary_key) == "12345")
constant_vars: list[str]#
obs_level: ObsLevel#
primary_key: str#
t_min: str#
t_max: str#
t_eligible: str#
t_outcome: str#
logger_args: dict[str, Any]#
sources: SourceDict#
project_vars: dict[str, dict[str, Any]]#
tmpdir_manager: Final[TemporaryDirectoryManager]#
load_default_vars(tmin=None, tmax=None)[source]#

Load the default variables defined in vars.json. It is recommended to use this after filtering your cohort for eligibility to speed up the process.

Returns:

Variables are loaded into the cohort.

Return type:

None

Examples

>>> # Load default variables for an ICU cohort
>>> cohort = Cohort(obs_level="icu_stay", load_default_vars=False)
>>> # Apply filters first (faster)
>>> cohort.include_list([
...     {"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"}
... ])
>>> # Then load default variables
>>> cohort.load_default_vars()
Parameters:
  • tmin (str | tuple[str, str] | None)

  • tmax (str | tuple[str, str] | None)

add_variable(variable, save_as=None, tmin=None, tmax=None)[source]#

Add a variable to the cohort.

You may specify tmin and tmax as a tuple (e.g. (“hospital_admission”, “+1d”)), in which case it will be relative to the hospital admission time of the patient.

Parameters:
  • variable (str | VariableProtocol | MultiSourceVariable) – Variable to add. Either a string with the variable name (from vars.json) or a Variable object.

  • save_as (str | None) – Name of the column to save the variable as. Defaults to variable name.

  • tmin (str | tuple[str, str] | None) – Name of the column to use as tmin or tuple (see description).

  • tmax (str | tuple[str, str] | None) – Name of the column to use as tmax or tuple (see description).

Returns:

The variable object.

Return type:

Variable

Examples

>>> cohort.add_variable("blood_sodium")
>>> cohort.add_variable(
...    variable="anx_dx_covid_19",
...    tmin=("hospital_admission", "-1d"),
...    tmax=cohort.t_eligible
... )
>>> cohort.add_variable(
...    NativeStatic(
...        var_name="highest_hct_before_eligible",
...        select="!max value",
...        base_var='blood_hematokrit',
...        tmax=cohort.t_eligible
...    )
... )
>>> cohort.add_variable(
...    variable='any_med_glu',
...    save_as="glucose_prior_eligible",
...    tmin=(cohort.t_eligible, "-48h"),
...    tmax=cohort.t_eligible
... )
load_variable(variable, tmin=None, tmax=None, include_sources=None)[source]#
Parameters:
  • variable (str | tuple[str, TimeWindow] | VariableProtocol | MultiSourceVariable)

  • tmin (str | tuple[str, str] | None)

  • tmax (str | tuple[str, str] | None)

  • include_sources (Iterable[str] | None)

Return type:

MultiSourceVariable

set_t_eligible(t_eligible, drop_ineligible=True)[source]#

Set the time anchor for eligibility. This can be referenced as cohort.t_eligible throughout the process and is required to add inclusion or exclusion criteria.

Parameters:
  • t_eligible (str) – Name of the column to use as t_eligible.

  • drop_ineligible (bool) – Whether to drop ineligible patients. Defaults to True.

Returns:

t_eligible is set.

Return type:

None

Examples

>>> # Add a suitable time-anchor variable
>>> cohort.add_variable(NativeStatic(
...    var_name="spo2_lt_90",
...    base_var="spo2",
...    select="!first recordtime",
...    where="value < 90",
... ))
>>> # Set the time anchor for eligibility
>>> cohort.set_t_eligible("spo2_lt_90")
set_t_outcome(t_outcome)[source]#

Set the time anchor for outcome. This can be referenced as cohort.t_outcome throughout the process and is recommended to specify for your study.

Parameters:

t_outcome (str) – Name of the column to use as t_outcome.

Returns:

t_outcome is set.

Return type:

None

Examples

>>> cohort.set_t_outcome("hospital_discharge")
change_tracker(description, group=None, mode='include')[source]#

Return a context manager to group cohort edits and record a single ChangeTracker state on exit.

Example

with cohort.change_tracker(“Adults”, mode=”include”) as track:

track.filter(pl.col(“age_on_admission”) >= 18)

Parameters:
  • description (str)

  • group (str | None)

  • mode (Literal['include', 'exclude'])

Return type:

ChangeTrackerContext

include(*args, **kwargs)[source]#

Add an inclusion criterion to the cohort. It is recommended to use Cohort.include_list() and add all of your inclusion criteria at once. However, if you need to specify criteria at a later stage, you can use this method.

Warning

You should call Cohort.include_list() before calling Cohort.include() to ensure that the inclusion criteria are properly tracked.

Parameters:
  • variable (str | Variable)

  • operation (str)

  • label (str)

  • operations_done (str)

  • [Optional – tmin, tmax]

Returns:

Criterion is added to the cohort.

Return type:

None

Note

operation is passed to pandas.DataFrame.query, which uses a slightly modified Python syntax. Also, if you specify “true”/”True” or “false”/”False” as a value for operation, it will be converted to “== True” or “== False”, respectively.

Examples

>>> cohort.include(
...    variable="age_on_admission",
...    operation=">= 18",
...    label="Adult",
...    operations_done="Include only adult patients"
... )
exclude(*args, **kwargs)[source]#

Add an exclusion criterion to the cohort. It is recommended to use Cohort.exclude_list() and add all of your exclusion criteria at once. However, if you need to specify criteria at a later stage, you can use this method.

Warning

You should call Cohort.exclude_list() before calling Cohort.exclude() to ensure that the exclusion criteria are properly tracked.

Parameters:
  • variable (str | Variable)

  • operation (str)

  • label (str)

  • operations_done (str)

  • [Optional – tmin, tmax]

Returns:

Criterion is added to the cohort.

Return type:

None

Note

operation is passed to pandas.DataFrame.query, which uses a slightly modified Python syntax. Also, if you specify “true”/”True” or “false”/”False” as a value for operation, it will be converted to “== True” or “== False”, respectively.

Examples

>>> cohort.exclude(
...    variable="elix_total",
...    operation="> 20",
...    operations_done="Exclude patients with high Elixhauser score"
... )
include_list(inclusion_list=[])[source]#

Add an inclusion criteria to the cohort.

Parameters:

inclusion_list (list) – List of inclusion criteria. Must include a dictionary with keys: * variable (str | Variable): Variable to use for exclusion * operation (str): Operation to apply (e.g., “> 5”, “== True”) * label (str): Short label for the exclusion step * operations_done (str): Detailed description of what this exclusion does * tmin (str, optional): Start time for variable extraction * tmax (str, optional): End time for variable extraction

Returns:

CohortTracker object, can be used to plot inclusion chart

Return type:

ct (CohortTracker)

Note

Per default, all inclusion criteria are applied from tmin=cohort.tmin to tmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.

Examples

>>> ct = cohort.include_list([
...    {
...        "variable": "age_on_admission",
...        "operation": ">= 18",
...        "label": "Adult patients",
...        "operations_done": "Excluded patients under 18 years old"
...    }
...  ])
>>> ct.create_flowchart()
exclude_list(exclusion_list=[])[source]#

Add an exclusion criteria to the cohort.

Parameters:

exclusion_list (list) –

List of exclusion criteria. Each criterion is a dictionary containing:

  • variable (str | Variable): Variable to use for exclusion

  • operation (str): Operation to apply (e.g., “> 5”, “== True”)

  • label (str): Short label for the exclusion step

  • operations_done (str): Detailed description of what this exclusion does

  • tmin (str, optional): Start time for variable extraction

  • tmax (str, optional): End time for variable extraction

Returns:

CohortTracker object, can be used to plot exclusion chart

Return type:

ct (CohortTracker)

Note

Per default, all exclusion criteria are applied from tmin=cohort.tmin to tmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.

Examples

>>> ct = cohort.exclude_list([
...    {
...        "variable": "any_rrt_icu",
...        "operation": "true",
...        "label": "No RRT",
...        "operations_done": "Excluded RRT before hypernatremia"
...    },
...    {
...        "variable": "any_dx_tbi",
...        "operation": "true",
...        "label": "No TBI",
...        "operations_done": "Excluded TBI before hypernatremia"
...    },
...    {
...        "variable": NativeStatic(
...            var_name="sodium_count",
...            select="!count value",
...            base_var="blood_sodium"),
...        "operation": "< 1",
...        "label": "Final cohort",
...        "operations_done": "Excluded cases with less than 1 sodium measurement after hypernatremia",
...        "tmin": cohort.t_eligible,
...        "tmax": "hospital_discharge"
...    }
...  ])
>>> ct.create_flowchart() # Plot the exclusion flowchart
add_variable_definition(var_name, var_dict)[source]#

Add or update a local variable definition.

Parameters:
  • var_name (str) – Name of the variable.

  • var_dict (dict) – Dictionary containing variable definition. Can be partial - missing fields will be inherited from global definition.

Return type:

None

Examples

Add a completely new variable:

>>> cohort.add_variable_definition("my_new_var", {
...     "type": "native_dynamic",
...     "table": "it_ishmed_labor",
...     "where": "c_katalog_leistungtext LIKE '%new%'",
...     "value_dtype": "DOUBLE",
...     "cleaning": {"value": {"low": 100, "high": 150}}
... })

Partially override existing variable:

>>> cohort.add_variable_definition("blood_sodium", {
...     "where": "c_katalog_leistungtext LIKE '%custom_sodium%'"
... })
get_variable_definition(var_name)[source]#

Get the variable definition for a given variable name.

Parameters:

var_name (str) – Name of the variable to get variable definitions for.

Returns:

Dictionary of variable definitions per source.

Return type:

definition (dict[str, dict[str, Any]])

add_inclusion()[source]#
Return type:

None

add_exclusion()[source]#
Return type:

None

to_files(folder, ext)[source]#

Convenience method to save the cohort to various file formats

Parameters:
  • folder (str | PathLike[str]) – Path to the folder where the files will be saved.

  • ext (CohortExportFormats) – File extension to use for saving the files.

Return type:

None

to_csv(folder)[source]#

Save the cohort to CSV files.

Parameters:

folder (str | PathLike[str]) – Path to the folder where CSV files will be saved.

Return type:

None

Examples

>>> cohort.to_csv("output_data")
>>> # Creates:
>>> # output_data/_obs.csv
>>> # output_data/blood_sodium.csv
>>> # output_data/heart_rate.csv
>>> # ... (one file per variable)
to_parquet(folder)[source]#

Save the cohort to parquet files.

Parameters:

folder (str | PathLike[str]) – Path to the folder where parquet files will be saved.

Return type:

None

Examples

>>> cohort.to_parquet("output_data")
>>> # Creates:
>>> # output_data/_obs.parquet
>>> # output_data/blood_sodium.parquet
>>> # output_data/heart_rate.parquet
>>> # ... (one file per variable)
save(filename)[source]#

Save the cohort to a single compressed .corr3 archive (.tar.zst equivalent). Saves cohort.__dict__ (excluding obs, obsm, variables, conn) to state.pkl, obs to obs.parquet, and each obsm DataFrame to obsm_<var_name>.parquet in a temp dir.

Parameters:

filename (str | PathLike[str]) – Path to the .corr3 archive

Return type:

None

Returns:

None

classmethod load(filename, password=None)[source]#
Parameters:
  • filename (str | PathLike[str])

  • password (str | None)

Return type:

Cohort

property tmpdir_path#
to_widget(*exprs)[source]#
Parameters:

exprs (IntoExpr | Iterable[IntoExpr])

Return type:

ObsWidget

property widget: ObsWidget#
to_search_widget(include_sources=None)[source]#
Parameters:

include_sources (Iterable[str] | None)

Return type:

JsonWidget | JsonmWidget

property search_widget: JsonWidget | JsonmWidget#
unnest(column, prefix='', suffix='', renamer=None)[source]#

Unnest a struct column in the obs DataFrame.

Parameters:
  • column (str) – The column to unnest.

  • prefix (str) – The prefix to add to the unnested column names.

  • suffix (str) – The suffix to add to the unnested column names.

  • renamer (Sequence[str] | Callable[[str], str] | None) – The renamer function or list of new names.

Returns:

The obs DataFrame with the unnested column.

Return type:

obs (pl.DataFrame)

debug_print()[source]#

Print debug information about the cohort. Please use this if you are creating a GitHub issue.

Return type:

None

Returns:

None

to_stata(df=None, convert_dates=None, write_index=True, to_file=None)[source]#

Convert the cohort to a Stata DataFrame. You may use cohort.stata to access the dataframe directly. Note that you need to save it to a top-level variable to access it via %%stata.

Parameters:
  • df (pd.DataFrame | None) – The DataFrame to be converted to Stata format. Will default to the obs DataFrame if unspecified (default: None)

  • convert_dates (dict[Hashable, StataDateFormat]) – Dictionary of columns to convert to Stata date format.

  • write_index (bool) – Whether to write the index as a column.

  • to_file (str | PathLike[str] | None) – Path to save as .dta file. If left unspecified, the DataFrame will not be saved.

Returns:

A Pandas Dataframe compatible with Stata if to_file is None.

Return type:

pd.DataFrame

property stata: DataFrame | None#
to_tableone(df=None, ignore_cols=None, filter_query=None, replace_booleans=('Yes', 'No'), display_all=True, groupby=None, normal_cols=None, overall=None, order=None, pval=False, **kwargs)[source]#

Create a TableOne object for the cohort.

Parameters:
  • df (pl.DataFrame) – The DataFrame to be converted to Stata format. Will default to the obs DataFrame if unspecified (default: None)

  • ignore_cols (list | str | None) – Column(s) to ignore.

  • filter_query (str | None) – Filter to apply to the data.

  • replace_booleans (tuple[str, str] | None) – Replace booleans with the given strings.

  • display_all (bool) – Whether to display all columns.

  • groupby (str | None) – Column to group by.

  • normal_cols (list[str] | None) – Columns to treat as normally distributed.

  • overall (bool) – Whether to add an “overall” column to the table. If left unspecified the overall column will be dropped if groupby is specified.

  • order (dict[str, list[str]] | None) – Order of categorical columns.

  • pval (bool) – Whether to calculate p-values.

  • **kwargs – Additional arguments to pass to TableOne.

Returns:

A TableOne object.

Return type:

TableOne

Examples

>>> tableone = cohort.tableone()
>>> print(tableone)
>>> tableone.to_csv("tableone.csv")
>>> tableone = cohort.tableone(groupby="sex", pval=False)
>>> print(tableone)
>>> tableone.to_csv("tableone_sex.csv")
property tableone: TableOne#
to_figureone()[source]#
Return type:

Digraph

property figureone: Digraph#