Legacy Interface (Pandas-Based)#

πŸ”„ Migration Recommended

For new projects, use the Polars-native interface (from corr_vars import Cohort) which is 2-10x faster and more memory-efficient.

This legacy interface is provided for backward compatibility only.

Overview#

The legacy interface provides a Pandas-based wrapper around the new Polars-native CORR-Vars interface. It was created to maintain backward compatibility with existing code that uses Pandas DataFrame methods, allowing researchers to continue using their existing analysis pipelines without immediate rewrites.

Interface Comparison at a Glance#

Aspect

Legacy Interface 🐼

Polars-Native Interface ⚑

Import

from corr_vars.legacy_v1 import Cohort

from corr_vars import Cohort

Data Access

cohort.obs (Pandas DataFrame)

cohort.obs (Polars DataFrame)

Performance

Slower (conversion overhead)

2-10x faster

Memory Usage

Higher (dual storage)

Lower (single storage)

Syntax

Familiar Pandas syntax

Modern Polars expressions

Recommendation

Existing code migration

New projects

What is the Legacy Interface?#

Note

Architecture Overview

The legacy interface is a compatibility layer that bridges old and new:

🐼 Your Pandas Code
.loc[], .groupby(), .head()
β†’
πŸ”„ Legacy Wrapper
Automatic conversion
β†’
⚑ Polars Backend
Fast & efficient

Key Features:

πŸ”„ Automatic Conversion

Seamlessly converts between Polars (internal) and Pandas (user-facing) representations

🐼 Familiar Syntax

Preserves familiar Pandas methods like .loc[], .iloc[], .groupby()

⚑ Modern Backend

Uses the new Polars backend internally for improved performance and stability

πŸ”§ Backward Compatible

Maintains compatibility with existing analysis scripts and workflows

Using the Legacy Interface#

# Legacy interface (Pandas access)
from corr_vars.legacy_v1 import Cohort

cohort = Cohort(
    obs_level="icu_stay",
    database="db_hypercapnia_prepared",
    password_file=True
)

# Pandas DataFrame access
print(type(cohort.obs))    # pandas.DataFrame wrapper
print(type(cohort.obsm))   # dict of pandas.DataFrame wrappers
# Polars-native interface (Recommended)
from corr_vars import Cohort

cohort = Cohort(
    obs_level="icu_stay",
    sources={"cub_hdp": {"database": "db_hypercapnia_prepared", "password_file": True}}
)

# Polars DataFrame access
print(type(cohort.obs))    # polars.DataFrame
print(type(cohort.obsm))   # dict of polars.DataFrame

πŸ’‘ Quick Start Guide

Step 1: Import the legacy interface

Step 2: Create your cohort with familiar parameters

Step 3: Use standard Pandas syntax for analysis

Pandas-Style Data Access:

# Static data access (exactly like Pandas)
print(cohort.obs.head())                    # First 5 rows
print(cohort.obs.shape)                     # (n_rows, n_cols)
print(cohort.obs.columns.tolist())          # Column names

# Familiar Pandas indexing and filtering
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
males = cohort.obs.loc[cohort.obs["sex"] == "M"]
specific_patient = cohort.obs.iloc[0]

# Pandas aggregation methods
summary = cohort.obs.groupby("sex").agg({
    "age_on_admission": ["mean", "std"],
    "inhospital_death": "sum"
})

Time-Series Data Access:

# Add dynamic variable
cohort.add_variable("blood_sodium")

# Access time-series data (Pandas DataFrame)
sodium_data = cohort.obsm["blood_sodium"]
print(type(sodium_data))  # <class 'LegacyObsmDataframe'> (behaves like pd.DataFrame)

# Familiar Pandas time-series operations
patient_data = sodium_data[sodium_data["icu_stay_id"] == "12345"]
daily_avg = sodium_data.groupby(sodium_data["recordtime"].dt.date)["value"].mean()

# Standard Pandas methods work
print(sodium_data.describe())
print(sodium_data.value_counts())

Column Assignment (Limited):

# Direct column assignment works for obs
cohort.obs["bmi_category"] = cohort.obs["weight"] / (cohort.obs["height"] / 100) ** 2
cohort.obs["is_elderly"] = cohort.obs["age_on_admission"] > 65

# Note: obsm DataFrames are read-only to prevent data corruption
# sodium_data["new_col"] = 1  # This will raise NotImplementedError

Pandas Method Compatibility#

βœ… Full Pandas Compatibility

The legacy interface supports most common Pandas DataFrame methods out of the box!

πŸ” Data Inspection
  • .head(), .tail(), .info()

  • .describe(), .shape, .columns

  • .nunique(), .value_counts()

  • .isnull(), .dtypes

🎯 Indexing & Selection
  • .loc[], .iloc[], .at[], .iat[]

  • .query(), boolean indexing

  • .filter(), .select_dtypes()

πŸ”§ Data Manipulation
  • .groupby(), .pivot_table()

  • .merge(), .join()

  • .sort_values(), .drop()

  • .drop_duplicates()

πŸ“Š Statistical Methods
  • .mean(), .median(), .std()

  • .corr(), .agg(), .apply()

  • .transform()

⏰ Time-Series Methods
  • .resample(), .rolling()

  • .expanding()

  • DateTime indexing

πŸ“ˆ Visualization Ready
  • Direct plotting with matplotlib

  • Seaborn compatibility

  • Works with existing viz code

Limitations of the Legacy Interface#

Warning

Important Limitations to Consider

While the legacy interface maintains compatibility, it has several important limitations that may affect performance and functionality.

Detailed Limitation Analysis

Understanding these limitations will help you decide when to migrate to the Polars-native interface.

Performance Limitations#

# Legacy interface: Data conversion overhead
large_cohort = Cohort(obs_level="icu_stay", load_default_vars=True)  # Slower

# Polars-native: Direct access, no conversion
from corr_vars import Cohort as PolarsCohort
fast_cohort = PolarsCohort(obs_level="icu_stay", load_default_vars=True)  # Faster
  1. Memory Overhead: Data is stored in Polars but converted to Pandas for access, requiring additional memory

  2. Conversion Costs: Each access to .obs or .obsm triggers Polars β†’ Pandas conversion

  3. Large Dataset Issues: Very large cohorts may hit memory limits during conversion

  4. Slower Operations: Pandas operations are generally slower than equivalent Polars operations

Functional Limitations#

# 1. Limited obsm modification
cohort.obsm["blood_sodium"]["new_column"] = 1  # NotImplementedError

# 2. No direct polars access
# cohort._obs.filter(pl.col("age") > 18)  # Not recommended, internal API

# 3. Some advanced Polars features unavailable
# No lazy evaluation, no expression API
  1. Read-Only obsm: Time-series DataFrames (obsm) are read-only to prevent data corruption

  2. No Polars Expression API: Cannot use Polars’ powerful expression syntax

  3. No Lazy Evaluation: Cannot benefit from Polars’ lazy evaluation optimizations

  4. Limited Parallel Processing: Pandas operations are less optimized for parallel execution

Data Type Limitations#

# Some Polars data types don't translate perfectly to Pandas
# May lose precision or type information in edge cases
print(cohort.obs.dtypes)  # May show different types than native Polars
  1. Type Conversion Issues: Some Polars types may not translate perfectly to Pandas

  2. Precision Loss: Potential precision loss in numeric conversions

  3. Missing Value Handling: Different null/missing value semantics between libraries

Migration Guide: Legacy β†’ Polars-Native#

Step 1: Update Imports

# Before (Legacy)
from corr_vars.legacy_v1 import Cohort

# After (Polars-native)
from corr_vars import Cohort

Step 2: Update Data Access Patterns

# Legacy Pandas syntax
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
male_patients = cohort.obs.loc[cohort.obs["sex"] == "M"]

# Polars-native equivalent
adults = cohort.obs.filter(pl.col("age_on_admission") >= 18)
male_patients = cohort.obs.filter(pl.col("sex") == "M")

Step 3: Update Aggregations

# Legacy Pandas groupby
summary = cohort.obs.groupby("sex").agg({
    "age_on_admission": ["mean", "std"],
    "inhospital_death": "sum"
})

# Polars-native equivalent
summary = cohort.obs.group_by("sex").agg([
    pl.col("age_on_admission").mean().alias("age_mean"),
    pl.col("age_on_admission").std().alias("age_std"),
    pl.col("inhospital_death").sum().alias("deaths")
])

Step 4: Update Time-Series Operations

# Legacy Pandas time-series
patient_data = cohort.obsm["blood_sodium"][
    cohort.obsm["blood_sodium"]["icu_stay_id"] == "12345"
]

# Polars-native equivalent
patient_data = cohort.obsm["blood_sodium"].filter(
    pl.col("icu_stay_id") == "12345"
)

Benefits of Migration:

  1. 2-10x Performance Improvement for most operations

  2. Lower Memory Usage (no conversion overhead)

  3. Better Type Safety and error handling

  4. Access to Modern Features like lazy evaluation and expression API

  5. Future-Proof Code as legacy interface may be deprecated

When to Use Legacy vs. Polars-Native#

🐼 Use Legacy Interface When:
βœ… Migrating Existing Code

You have extensive Pandas-based analysis pipelines

βœ… Team Training Time

Your team needs time to learn Polars syntax

βœ… External Dependencies

Your code integrates with Pandas-only libraries

βœ… Proof of Concepts

Quick prototyping with familiar syntax

⚠️ Temporary Migration Step

Use as stepping stone to Polars-native

⚑ Use Polars-Native Interface When:
βœ… New Projects

Starting fresh analysis projects

βœ… Performance Critical

Working with large datasets or complex operations

βœ… Memory Constrained

Limited memory environments

βœ… Production Code

Building robust, long-term analysis pipelines

βœ… Modern Features

Want to leverage advanced Polars capabilities

πŸš€ Future-Proof Choice

Recommended for all new development

🎯 Decision Matrix

Your Situation

Legacy Interface 🐼

Polars-Native ⚑

New research project

❌ Not recommended

βœ… Recommended

Existing Pandas codebase

βœ… Good transition option

πŸ”„ Migrate gradually

Large datasets (>1GB)

⚠️ Performance issues

βœ… Optimal performance

Team learning curve

βœ… Familiar syntax

πŸ“š Investment in learning

Production deployment

⚠️ Legacy, may deprecate

βœ… Future-proof

Example: Side-by-Side Comparison#

πŸ” Real-World Performance Example

Both examples below produce identical results, but with very different performance characteristics.

from corr_vars.legacy_v1 import Cohort

# Create cohort (slower initialization)
cohort = Cohort(obs_level="icu_stay", database="db_hypercapnia_prepared")

# Pandas-style analysis (familiar syntax)
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
summary = adults.groupby("sex").agg({
    "age_on_admission": "mean",
    "inhospital_death": "sum"
})

# Time-series analysis
sodium = cohort.obsm["blood_sodium"]
patient_trends = sodium.groupby("icu_stay_id")["value"].agg(["first", "last", "mean"])
from corr_vars import Cohort
import polars as pl

# Create cohort (faster initialization)
cohort = Cohort(obs_level="icu_stay", sources={"cub_hdp": {"database": "db_hypercapnia_prepared"}})

# Polars-style analysis (faster execution)
summary = cohort.obs.filter(pl.col("age_on_admission") >= 18).group_by("sex").agg([
    pl.col("age_on_admission").mean().alias("mean_age"),
    pl.col("inhospital_death").sum().alias("deaths")
])

# Time-series analysis (more efficient)
patient_trends = cohort.obsm["blood_sodium"].group_by("icu_stay_id").agg([
    pl.col("value").first().alias("first_sodium"),
    pl.col("value").last().alias("last_sodium"),
    pl.col("value").mean().alias("mean_sodium")
])

πŸš€ Ready to Migrate?

Start your migration journey with the Tutorials and Getting Started and explore the Custom Variables Guide guide to learn modern Polars patterns!

πŸ“š Related Documentation

API Reference#

class corr_vars.legacy_v1.Cohort(conn_args={}, logger_args={}, password_file=None, database='db_hypercapnia_prepared', extraction_end_date=None, obs_level='icu_stay', project_vars={}, merge_consecutive=True, load_default_vars=True, filters='')[source]#

Bases: Cohort

Legacy class to build a cohort in the CORR database. This version uses Pandas to access .obs and .obsm attributes. Please migrate to the new version of the cohort class at your earliest convenience.

Parameters:
  • conn_args (dict) – Dictionary of Database credentials [remote_hostname (str), username (str)] (default: {}).

  • logger_args (dict) – Dictionary of Logging configurations [level (int), file_path (str), file_mode (str), verbose_fmt (bool), colored_output (bool), formatted_numbers (bool)] (default: {}).

  • password_file (str | bool) – Path to the password file or True if your file is in ~/password.txt (default: None).

  • database (Literal["db_hypercapnia_prepared", "db_corror_prepared"]) – Database to use (default: β€œdb_hypercapnia_prepared”).

  • extraction_end_date (str) – Deprecated as of Feb 6, 2025. This was used to set the end date for the extraction. Use filters instead (default: None).

  • obs_level (Literal["icu_stay", "hospital_stay", "procedure"]) – Observation level (default: β€œicu_stay”).

  • project_vars (dict) – Dictionary with local variable definitions (default: {}).

  • merge_consecutive (bool) – Whether to merge consecutive ICU stays (default: True). Does not apply to any other obs_level.

  • load_default_vars (bool) – Whether to load the default variables (default: True).

  • filters (str) – Initial filters (must be a valid SQL WHERE clause for the it_ishmed_fall table) (default: β€œβ€).

obs#

Static data for each observation. Contains one row per observation (e.g., ICU stay) with columns for static variables like demographics and outcomes.

Example

>>> cohort.obs
patient_id  case_id icu_stay_id            icu_admission        icu_discharge sex   ... inhospital_death
0  P001         C001    C001_1       2023-01-01 08:30:00  2023-01-03 12:00:00   M   ...  False
1  P001         C001    C001_2       2023-01-03 14:20:00  2023-01-05 16:30:00   M   ...  False
2  P002         C002    C002_1       2023-01-02 09:15:00  2023-01-04 10:30:00   F   ...  False
3  P003         C003    C003_1       2023-01-04 11:45:00  2023-01-07 13:20:00   F   ...  True
...
Type:

pd.DataFrame

obsm#

Dynamic data stored as dictionary of DataFrames. Each DataFrame contains time-series data for a variable with columns:

  • recordtime: Timestamp of the measurement

  • value: Value of the measurement

  • recordtime_end: End time (only for duration-based variables like therapies)

  • description: Additional information (e.g., medication names)

Example

>>> cohort.obsm["blood_sodium"]
   icu_stay_id          recordtime  value
0  C001_1      2023-01-01 09:30:00   138
1  C001_1      2023-01-02 10:15:00   141
2  C001_2      2023-01-03 15:00:00   137
3  C002_1      2023-01-02 10:00:00   142
4  C003_1      2023-01-04 12:30:00   139
...
Type:

dict of pd.DataFrame

variables#

Dictionary of all variable objects in the cohort. This is used to keep track of variable metadata.

Type:

dict of Variable

Notes

  • For large cohorts, set load_default_vars=False to speed up the extraction. You can use pre-extracted cohorts as starting points and load them using Cohort.load().

  • Variables can be added using cohort.add_variable(). Static variables will be added to obs, dynamic variables to obsm.

  • filters also allows a special shorthand β€œ_dx” to extract the hospital admissions of the last x months, useful for debugging/prototyping. For example use β€œ_d2” to extract every admission of the last 2 months.

Examples

Create a new cohort:

>>> cohort = Cohort(obs_level="icu_stay",
...                 database="db_hypercapnia_prepared",
...                 load_default_vars=False,
...                 password_file=True)

Access static data:

>>> cohort.obs["age_on_admission"]  # Get age for all patients
>>> cohort.obs.loc[cohort.obs["sex"] == "M"]  # Filter for male patients

Access time-series data:

>>> cohort.obsm["blood_sodium"]  # Get all blood sodium measurements
>>> # Get blood sodium measurements for a specific observation
>>> cohort.obsm["blood_sodium"].loc[
...     cohort.obsm["blood_sodium"][cohort.primary_key] == "12345"
... ]
__init__(conn_args={}, logger_args={}, password_file=None, database='db_hypercapnia_prepared', extraction_end_date=None, obs_level='icu_stay', project_vars={}, merge_consecutive=True, load_default_vars=True, filters='')[source]#
Parameters:
  • conn_args (dict)

  • logger_args (dict)

  • password_file (Union[str, bool, None])

  • database (Literal['db_hypercapnia_prepared', 'db_corror_prepared'])

  • extraction_end_date (Optional[str])

  • obs_level (Literal['icu_stay', 'hospital_stay', 'procedure'])

  • project_vars (dict)

  • merge_consecutive (bool)

  • load_default_vars (bool)

  • filters (str)

add_inclusion(inclusion_list=[])[source]#

Add an inclusion criteria to the cohort.

Parameters:

inclusion_list (list) – List of inclusion criteria. Must include a dictionary with keys: * variable (str | Variable): Variable to use for exclusion * operation (str): Operation to apply (e.g., β€œ> 5”, β€œ== True”) * label (str): Short label for the exclusion step * operations_done (str): Detailed description of what this exclusion does * tmin (str, optional): Start time for variable extraction * tmax (str, optional): End time for variable extraction

Returns:

CohortTracker object, can be used to plot inclusion chart

Return type:

ct (CohortTracker)

Note

Per default, all inclusion criteria are applied from tmin=cohort.tmin to tmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.

Examples

>>> ct = cohort.include_list([
...    {
...        "variable": "age_on_admission",
...        "operation": ">= 18",
...        "label": "Adult patients",
...        "operations_done": "Excluded patients under 18 years old"
...    }
...  ])
>>> ct.create_flowchart()
add_exclusion(exclusion_list=[])[source]#

Add an exclusion criteria to the cohort.

Parameters:

exclusion_list (list) –

List of exclusion criteria. Each criterion is a dictionary containing:

  • variable (str | Variable): Variable to use for exclusion

  • operation (str): Operation to apply (e.g., β€œ> 5”, β€œ== True”)

  • label (str): Short label for the exclusion step

  • operations_done (str): Detailed description of what this exclusion does

  • tmin (str, optional): Start time for variable extraction

  • tmax (str, optional): End time for variable extraction

Returns:

CohortTracker object, can be used to plot exclusion chart

Return type:

ct (CohortTracker)

Note

Per default, all exclusion criteria are applied from tmin=cohort.tmin to tmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.

Examples

>>> ct = cohort.exclude_list([
...    {
...        "variable": "any_rrt_icu",
...        "operation": "true",
...        "label": "No RRT",
...        "operations_done": "Excluded RRT before hypernatremia"
...    },
...    {
...        "variable": "any_dx_tbi",
...        "operation": "true",
...        "label": "No TBI",
...        "operations_done": "Excluded TBI before hypernatremia"
...    },
...    {
...        "variable": NativeStatic(
...            var_name="sodium_count",
...            select="!count value",
...            base_var="blood_sodium"),
...        "operation": "< 1",
...        "label": "Final cohort",
...        "operations_done": "Excluded cases with less than 1 sodium measurement after hypernatremia",
...        "tmin": cohort.t_eligible,
...        "tmax": "hospital_discharge"
...    }
...  ])
>>> ct.create_flowchart() # Plot the exclusion flowchart
save(filename, legacy=False)[source]#

Save the cohort to a .corr2 archive

Parameters:
  • filename (str) – Path to the .corr2 archive.

  • legacy (bool)

classmethod load(filename, password_file=None)[source]#

Load a cohort from a pickle file. If this file was saved by a different user, you need to pass your database credentials to the function.

Parameters:
  • filename (str) – Path to the pickle file.

  • password_file (Optional[str]) – Path to the password file.

Returns:

A new Cohort object.

Return type:

Cohort

add_variable(variable, save_as=None, tmin=None, tmax=None)[source]#

Add a variable to the cohort.

You may specify tmin and tmax as a tuple (e.g. (β€œhospital_admission”, β€œ+1d”)), in which case it will be relative to the hospital admission time of the patient.

Parameters:
  • variable (str | Variable | MultiSourceVariable) – Variable to add. Either a string with the variable name (from vars.json) or a Variable object.

  • save_as (Optional[str]) – Name of the column to save the variable as. Defaults to variable name.

  • tmin (Union[str, tuple[str, str], None]) – Name of the column to use as tmin or tuple (see description).

  • tmax (Union[str, tuple[str, str], None]) – Name of the column to use as tmax or tuple (see description).

Returns:

The variable object.

Return type:

Variable

Examples

>>> cohort.add_variable("blood_sodium")
>>> cohort.add_variable(
...    variable="anx_dx_covid_19",
...    tmin=("hospital_admission", "-1d"),
...    tmax=cohort.t_eligible
... )
>>> cohort.add_variable(
...    NativeStatic(
...        var_name="highest_hct_before_eligible",
...        select="!max value",
...        base_var='blood_hematokrit',
...        tmax=cohort.t_eligible
...    )
... )
>>> cohort.add_variable(
...    variable='any_med_glu',
...    save_as="glucose_prior_eligible",
...    tmin=(cohort.t_eligible, "-48h"),
...    tmax=cohort.t_eligible
... )
add_variable_definition(var_name, var_dict)[source]#

Add or update a local variable definition.

Parameters:
  • var_name (str) – Name of the variable.

  • var_dict (dict) – Dictionary containing variable definition. Can be partial - missing fields will be inherited from global definition.

Return type:

None

Examples

Add a completely new variable:

>>> cohort.add_variable_definition("my_new_var", {
...     "type": "native_dynamic",
...     "table": "it_ishmed_labor",
...     "where": "c_katalog_leistungtext LIKE '%new%'",
...     "value_dtype": "DOUBLE",
...     "cleaning": {"value": {"low": 100, "high": 150}}
... })

Partially override existing variable:

>>> cohort.add_variable_definition("blood_sodium", {
...     "where": "c_katalog_leistungtext LIKE '%custom_sodium%'"
... })
change_tracker(description, mode='include')[source]#

Return a context manager to group cohort edits and record a single ChangeTracker state on exit.

Example

with cohort.change_tracker(β€œAdults”, mode=”include”) as track:

track.filter(pl.col(β€œage_on_admission”) >= 18)

Parameters:
  • description (str)

  • mode (Literal['include', 'exclude'])

debug_print()[source]#

Print debug information about the cohort. Please use this if you are creating a GitHub issue.

Return type:

None

Returns:

None

exclude(*args, **kwargs)[source]#

Add an exclusion criterion to the cohort. It is recommended to use Cohort.exclude_list() and add all of your exclusion criteria at once. However, if you need to specify criteria at a later stage, you can use this method.

Warning

You should call Cohort.exclude_list() before calling Cohort.exclude() to ensure that the exclusion criteria are properly tracked.

Parameters:
  • variable (str | Variable)

  • operation (str)

  • label (str)

  • operations_done (str)

  • [Optional – tmin, tmax]

Returns:

Criterion is added to the cohort.

Return type:

None

Note

operation is passed to pandas.DataFrame.query, which uses a slightly modified Python syntax. Also, if you specify β€œtrue”/”True” or β€œfalse”/”False” as a value for operation, it will be converted to β€œ== True” or β€œ== False”, respectively.

Examples

>>> cohort.exclude(
...    variable="elix_total",
...    operation="> 20",
...    operations_done="Exclude patients with high Elixhauser score"
... )
exclude_list(exclusion_list=[])[source]#

Add an exclusion criteria to the cohort.

Parameters:

exclusion_list (list) –

List of exclusion criteria. Each criterion is a dictionary containing:

  • variable (str | Variable): Variable to use for exclusion

  • operation (str): Operation to apply (e.g., β€œ> 5”, β€œ== True”)

  • label (str): Short label for the exclusion step

  • operations_done (str): Detailed description of what this exclusion does

  • tmin (str, optional): Start time for variable extraction

  • tmax (str, optional): End time for variable extraction

Returns:

CohortTracker object, can be used to plot exclusion chart

Return type:

ct (CohortTracker)

Note

Per default, all exclusion criteria are applied from tmin=cohort.tmin to tmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.

Examples

>>> ct = cohort.exclude_list([
...    {
...        "variable": "any_rrt_icu",
...        "operation": "true",
...        "label": "No RRT",
...        "operations_done": "Excluded RRT before hypernatremia"
...    },
...    {
...        "variable": "any_dx_tbi",
...        "operation": "true",
...        "label": "No TBI",
...        "operations_done": "Excluded TBI before hypernatremia"
...    },
...    {
...        "variable": NativeStatic(
...            var_name="sodium_count",
...            select="!count value",
...            base_var="blood_sodium"),
...        "operation": "< 1",
...        "label": "Final cohort",
...        "operations_done": "Excluded cases with less than 1 sodium measurement after hypernatremia",
...        "tmin": cohort.t_eligible,
...        "tmax": "hospital_discharge"
...    }
...  ])
>>> ct.create_flowchart() # Plot the exclusion flowchart
include(*args, **kwargs)[source]#

Add an inclusion criterion to the cohort. It is recommended to use Cohort.include_list() and add all of your inclusion criteria at once. However, if you need to specify criteria at a later stage, you can use this method.

Warning

You should call Cohort.include_list() before calling Cohort.include() to ensure that the inclusion criteria are properly tracked.

Parameters:
  • variable (str | Variable)

  • operation (str)

  • label (str)

  • operations_done (str)

  • [Optional – tmin, tmax]

Returns:

Criterion is added to the cohort.

Return type:

None

Note

operation is passed to pandas.DataFrame.query, which uses a slightly modified Python syntax. Also, if you specify β€œtrue”/”True” or β€œfalse”/”False” as a value for operation, it will be converted to β€œ== True” or β€œ== False”, respectively.

Examples

>>> cohort.include(
...    variable="age_on_admission",
...    operation=">= 18",
...    label="Adult",
...    operations_done="Include only adult patients"
... )
include_list(inclusion_list=[])[source]#

Add an inclusion criteria to the cohort.

Parameters:

inclusion_list (list) – List of inclusion criteria. Must include a dictionary with keys: * variable (str | Variable): Variable to use for exclusion * operation (str): Operation to apply (e.g., β€œ> 5”, β€œ== True”) * label (str): Short label for the exclusion step * operations_done (str): Detailed description of what this exclusion does * tmin (str, optional): Start time for variable extraction * tmax (str, optional): End time for variable extraction

Returns:

CohortTracker object, can be used to plot inclusion chart

Return type:

ct (CohortTracker)

Note

Per default, all inclusion criteria are applied from tmin=cohort.tmin to tmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.

Examples

>>> ct = cohort.include_list([
...    {
...        "variable": "age_on_admission",
...        "operation": ">= 18",
...        "label": "Adult patients",
...        "operations_done": "Excluded patients under 18 years old"
...    }
...  ])
>>> ct.create_flowchart()
load_default_vars()[source]#

Load the default variables defined in vars.json. It is recommended to use this after filtering your cohort for eligibility to speed up the process.

Returns:

Variables are loaded into the cohort.

Return type:

None

Examples

>>> # Load default variables for an ICU cohort
>>> cohort = Cohort(obs_level="icu_stay", load_default_vars=False)
>>> # Apply filters first (faster)
>>> cohort.include_list([
...     {"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"}
... ])
>>> # Then load default variables
>>> cohort.load_default_vars()
set_t_eligible(t_eligible, drop_ineligible=True)[source]#

Set the time anchor for eligibility. This can be referenced as cohort.t_eligible throughout the process and is required to add inclusion or exclusion criteria.

Parameters:
  • t_eligible (str) – Name of the column to use as t_eligible.

  • drop_ineligible (bool) – Whether to drop ineligible patients. Defaults to True.

Returns:

t_eligible is set.

Return type:

None

Examples

>>> # Add a suitable time-anchor variable
>>> cohort.add_variable(NativeStatic(
...    var_name="spo2_lt_90",
...    base_var="spo2",
...    select="!first recordtime",
...    where="value < 90",
... ))
>>> # Set the time anchor for eligibility
>>> cohort.set_t_eligible("spo2_lt_90")
set_t_outcome(t_outcome)[source]#

Set the time anchor for outcome. This can be referenced as cohort.t_outcome throughout the process and is recommended to specify for your study.

Parameters:

t_outcome (str) – Name of the column to use as t_outcome.

Returns:

t_outcome is set.

Return type:

None

Examples

>>> cohort.set_t_outcome("hospital_discharge")
property stata: DataFrame | None#

Convert the cohort to a Stata DataFrame. You may use cohort.stata to access the dataframe directly. Note that you need to save it to a top-level variable to access it via %%stata.

Parameters:
  • df (pd.DataFrame) – The DataFrame to be converted to Stata format. Will default to the obs DataFrame if unspecified (default: None)

  • convert_dates (dict[Hashable, str]) – Dictionary of columns to convert to Stata date format.

  • write_index (bool) – Whether to write the index as a column.

  • to_file (str | None) – Path to save as .dta file. If left unspecified, the DataFrame will not be saved.

Returns:

A Pandas Dataframe compatible with Stata if to_file is None.

Return type:

pd.DataFrame

property tableone: TableOne#

Create a TableOne object for the cohort.

Parameters:
  • ignore_cols (list | str) – Column(s) to ignore.

  • groupby (str) – Column to group by.

  • filter (str) – Filter to apply to the data.

  • pval (bool) – Whether to calculate p-values.

  • normal_cols (list[str]) – Columns to treat as normally distributed.

  • **kwargs – Additional arguments to pass to TableOne.

Returns:

A TableOne object.

Return type:

TableOne

Examples

>>> tableone = cohort.tableone()
>>> print(tableone)
>>> tableone.to_csv("tableone.csv")
>>> tableone = cohort.tableone(groupby="sex", pval=False)
>>> print(tableone)
>>> tableone.to_csv("tableone_sex.csv")
to_csv(folder)[source]#

Save the cohort to CSV files.

Parameters:

folder (str) – Path to the folder where CSV files will be saved.

Return type:

None

Examples

>>> cohort.to_csv("output_data")
>>> # Creates:
>>> # output_data/_obs.csv
>>> # output_data/blood_sodium.csv
>>> # output_data/heart_rate.csv
>>> # ... (one file per variable)
to_stata(df=None, convert_dates=None, write_index=True, to_file=None)[source]#

Convert the cohort to a Stata DataFrame. You may use cohort.stata to access the dataframe directly. Note that you need to save it to a top-level variable to access it via %%stata.

Parameters:
  • df (pd.DataFrame) – The DataFrame to be converted to Stata format. Will default to the obs DataFrame if unspecified (default: None)

  • convert_dates (dict[Hashable, str]) – Dictionary of columns to convert to Stata date format.

  • write_index (bool) – Whether to write the index as a column.

  • to_file (str | None) – Path to save as .dta file. If left unspecified, the DataFrame will not be saved.

Returns:

A Pandas Dataframe compatible with Stata if to_file is None.

Return type:

pd.DataFrame

to_tableone(ignore_cols=[], groupby=None, filter=None, pval=False, normal_cols=[], **kwargs)[source]#

Create a TableOne object for the cohort.

Parameters:
  • ignore_cols (list | str) – Column(s) to ignore.

  • groupby (str) – Column to group by.

  • filter (str) – Filter to apply to the data.

  • pval (bool) – Whether to calculate p-values.

  • normal_cols (list[str]) – Columns to treat as normally distributed.

  • **kwargs – Additional arguments to pass to TableOne.

Returns:

A TableOne object.

Return type:

TableOne

Examples

>>> tableone = cohort.tableone()
>>> print(tableone)
>>> tableone.to_csv("tableone.csv")
>>> tableone = cohort.tableone(groupby="sex", pval=False)
>>> print(tableone)
>>> tableone.to_csv("tableone_sex.csv")