Legacy Interface (Pandas-Based)

Legacy Interface (Pandas-Based)#

🔄 Migration Recommended

For new projects, use the Polars-native interface (from corr_vars import Cohort) which is 2-10x faster and more memory-efficient.

This legacy interface is provided for backward compatibility only.

Overview #

The legacy interface provides a Pandas-based wrapper around the new Polars-native CORR-Vars interface. It was created to maintain backward compatibility with existing code that uses Pandas DataFrame methods, allowing researchers to continue using their existing analysis pipelines without immediate rewrites.

**Interface Comparison at a Glance**#
Aspect	Legacy Interface 🐼	Polars-Native Interface ⚡
Import	`from corr_vars.legacy_v1 import Cohort`	`from corr_vars import Cohort`
Data Access	`cohort.obs` (Pandas DataFrame)	`cohort.obs` (Polars DataFrame)
Performance	Slower (conversion overhead)	2-10x faster
Memory Usage	Higher (dual storage)	Lower (single storage)
Syntax	Familiar Pandas syntax	Modern Polars expressions
Recommendation	Existing code migration	New projects

What is the Legacy Interface?#

Note

Architecture Overview

The legacy interface is a compatibility layer that bridges old and new:

🐼 Your Pandas Code
.loc[], .groupby(), .head()

→

🔄 Legacy Wrapper
Automatic conversion

→

⚡ Polars Backend
Fast & efficient

Key Features:

🔄 Automatic Conversion

Seamlessly converts between Polars (internal) and Pandas (user-facing) representations

🐼 Familiar Syntax

Preserves familiar Pandas methods like .loc[], .iloc[], .groupby()

⚡ Modern Backend

Uses the new Polars backend internally for improved performance and stability

🔧 Backward Compatible

Maintains compatibility with existing analysis scripts and workflows

Using the Legacy Interface #

🐼 Legacy Interface

# Legacy interface (Pandas access)
from corr_vars.legacy_v1 import Cohort

cohort = Cohort(
    obs_level="icu_stay",
    database="db_hypercapnia_prepared",
    password_file=True
)

# Pandas DataFrame access
print(type(cohort.obs))    # pandas.DataFrame wrapper
print(type(cohort.obsm))   # dict of pandas.DataFrame wrappers

⚡ Polars-Native

# Polars-native interface (Recommended)
from corr_vars import Cohort

cohort = Cohort(
    obs_level="icu_stay",
    sources={"cub_hdp": {"database": "db_hypercapnia_prepared", "password_file": True}}
)

# Polars DataFrame access
print(type(cohort.obs))    # polars.DataFrame
print(type(cohort.obsm))   # dict of polars.DataFrame

💡 Quick Start Guide

Step 1: Import the legacy interface

Step 2: Create your cohort with familiar parameters

Step 3: Use standard Pandas syntax for analysis

Pandas-Style Data Access:

# Static data access (exactly like Pandas)
print(cohort.obs.head())                    # First 5 rows
print(cohort.obs.shape)                     # (n_rows, n_cols)
print(cohort.obs.columns.tolist())          # Column names

# Familiar Pandas indexing and filtering
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
males = cohort.obs.loc[cohort.obs["sex"] == "M"]
specific_patient = cohort.obs.iloc[0]

# Pandas aggregation methods
summary = cohort.obs.groupby("sex").agg({
    "age_on_admission": ["mean", "std"],
    "inhospital_death": "sum"
})

Time-Series Data Access:

# Add dynamic variable
cohort.add_variable("blood_sodium")

# Access time-series data (Pandas DataFrame)
sodium_data = cohort.obsm["blood_sodium"]
print(type(sodium_data))  # <class 'LegacyObsmDataframe'> (behaves like pd.DataFrame)

# Familiar Pandas time-series operations
patient_data = sodium_data[sodium_data["icu_stay_id"] == "12345"]
daily_avg = sodium_data.groupby(sodium_data["recordtime"].dt.date)["value"].mean()

# Standard Pandas methods work
print(sodium_data.describe())
print(sodium_data.value_counts())

Column Assignment (Limited):

# Direct column assignment works for obs
cohort.obs["bmi_category"] = cohort.obs["weight"] / (cohort.obs["height"] / 100) ** 2
cohort.obs["is_elderly"] = cohort.obs["age_on_admission"] > 65

# Note: obsm DataFrames are read-only to prevent data corruption
# sodium_data["new_col"] = 1  # This will raise NotImplementedError

Pandas Method Compatibility #

✅ Full Pandas Compatibility

The legacy interface supports most common Pandas DataFrame methods out of the box!

🔍 Data Inspection

.head(), .tail(), .info()
.describe(), .shape, .columns
.nunique(), .value_counts()
.isnull(), .dtypes

🎯 Indexing & Selection

.loc[], .iloc[], .at[], .iat[]
.query(), boolean indexing
.filter(), .select_dtypes()

🔧 Data Manipulation

.groupby(), .pivot_table()
.merge(), .join()
.sort_values(), .drop()
.drop_duplicates()

📊 Statistical Methods

.mean(), .median(), .std()
.corr(), .agg(), .apply()
.transform()

⏰ Time-Series Methods

.resample(), .rolling()
.expanding()
DateTime indexing

📈 Visualization Ready

Direct plotting with matplotlib
Seaborn compatibility
Works with existing viz code

Limitations of the Legacy Interface #

Warning

Important Limitations to Consider

While the legacy interface maintains compatibility, it has several important limitations that may affect performance and functionality.

Performance Limitations#

# Legacy interface: Data conversion overhead
large_cohort = Cohort(obs_level="icu_stay", load_default_vars=True)  # Slower

# Polars-native: Direct access, no conversion
from corr_vars import Cohort as PolarsCohort
fast_cohort = PolarsCohort(obs_level="icu_stay", load_default_vars=True)  # Faster

Memory Overhead: Data is stored in Polars but converted to Pandas for access, requiring additional memory
Conversion Costs: Each access to .obs or .obsm triggers Polars → Pandas conversion
Large Dataset Issues: Very large cohorts may hit memory limits during conversion
Slower Operations: Pandas operations are generally slower than equivalent Polars operations

Functional Limitations#

# 1. Limited obsm modification
cohort.obsm["blood_sodium"]["new_column"] = 1  # NotImplementedError

# 2. No direct polars access
# cohort._obs.filter(pl.col("age") > 18)  # Not recommended, internal API

# 3. Some advanced Polars features unavailable
# No lazy evaluation, no expression API

Read-Only obsm: Time-series DataFrames (obsm) are read-only to prevent data corruption
No Polars Expression API: Cannot use Polars’ powerful expression syntax
No Lazy Evaluation: Cannot benefit from Polars’ lazy evaluation optimizations
Limited Parallel Processing: Pandas operations are less optimized for parallel execution

Data Type Limitations#

# Some Polars data types don't translate perfectly to Pandas
# May lose precision or type information in edge cases
print(cohort.obs.dtypes)  # May show different types than native Polars

Type Conversion Issues: Some Polars types may not translate perfectly to Pandas
Precision Loss: Potential precision loss in numeric conversions
Missing Value Handling: Different null/missing value semantics between libraries

Migration Guide: Legacy → Polars-Native #

Step 1: Update Imports

# Before (Legacy)
from corr_vars.legacy_v1 import Cohort

# After (Polars-native)
from corr_vars import Cohort

Step 2: Update Data Access Patterns

# Legacy Pandas syntax
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
male_patients = cohort.obs.loc[cohort.obs["sex"] == "M"]

# Polars-native equivalent
adults = cohort.obs.filter(pl.col("age_on_admission") >= 18)
male_patients = cohort.obs.filter(pl.col("sex") == "M")

Step 3: Update Aggregations

# Legacy Pandas groupby
summary = cohort.obs.groupby("sex").agg({
    "age_on_admission": ["mean", "std"],
    "inhospital_death": "sum"
})

# Polars-native equivalent
summary = cohort.obs.group_by("sex").agg([
    pl.col("age_on_admission").mean().alias("age_mean"),
    pl.col("age_on_admission").std().alias("age_std"),
    pl.col("inhospital_death").sum().alias("deaths")
])

Step 4: Update Time-Series Operations

# Legacy Pandas time-series
patient_data = cohort.obsm["blood_sodium"][
    cohort.obsm["blood_sodium"]["icu_stay_id"] == "12345"
]

# Polars-native equivalent
patient_data = cohort.obsm["blood_sodium"].filter(
    pl.col("icu_stay_id") == "12345"
)

Benefits of Migration:

2-10x Performance Improvement for most operations
Lower Memory Usage (no conversion overhead)
Better Type Safety and error handling
Access to Modern Features like lazy evaluation and expression API
Future-Proof Code as legacy interface may be deprecated

When to Use Legacy vs. Polars-Native #

🐼 Use Legacy Interface When:

✅ Migrating Existing Code: You have extensive Pandas-based analysis pipelines
✅ Team Training Time: Your team needs time to learn Polars syntax
✅ External Dependencies: Your code integrates with Pandas-only libraries
✅ Proof of Concepts: Quick prototyping with familiar syntax
⚠️ Temporary Migration Step: Use as stepping stone to Polars-native

⚡ Use Polars-Native Interface When:

✅ New Projects: Starting fresh analysis projects
✅ Performance Critical: Working with large datasets or complex operations
✅ Memory Constrained: Limited memory environments
✅ Production Code: Building robust, long-term analysis pipelines
✅ Modern Features: Want to leverage advanced Polars capabilities
🚀 Future-Proof Choice: Recommended for all new development

🎯 Decision Matrix

Your Situation	Legacy Interface 🐼	Polars-Native ⚡
New research project	❌ Not recommended	✅ Recommended
Existing Pandas codebase	✅ Good transition option	🔄 Migrate gradually
Large datasets (>1GB)	⚠️ Performance issues	✅ Optimal performance
Team learning curve	✅ Familiar syntax	📚 Investment in learning
Production deployment	⚠️ Legacy, may deprecate	✅ Future-proof

Example: Side-by-Side Comparison #

🔍 Real-World Performance Example

Both examples below produce identical results, but with very different performance characteristics.

🐼 Legacy Interface (Familiar)

from corr_vars.legacy_v1 import Cohort

# Create cohort (slower initialization)
cohort = Cohort(obs_level="icu_stay", database="db_hypercapnia_prepared")

# Pandas-style analysis (familiar syntax)
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
summary = adults.groupby("sex").agg({
    "age_on_admission": "mean",
    "inhospital_death": "sum"
})

# Time-series analysis
sodium = cohort.obsm["blood_sodium"]
patient_trends = sodium.groupby("icu_stay_id")["value"].agg(["first", "last", "mean"])

⚡ Polars-Native (Recommended)

from corr_vars import Cohort
import polars as pl

# Create cohort (faster initialization)
cohort = Cohort(obs_level="icu_stay", sources={"cub_hdp": {"database": "db_hypercapnia_prepared"}})

# Polars-style analysis (faster execution)
summary = cohort.obs.filter(pl.col("age_on_admission") >= 18).group_by("sex").agg([
    pl.col("age_on_admission").mean().alias("mean_age"),
    pl.col("inhospital_death").sum().alias("deaths")
])

# Time-series analysis (more efficient)
patient_trends = cohort.obsm["blood_sodium"].group_by("icu_stay_id").agg([
    pl.col("value").first().alias("first_sodium"),
    pl.col("value").last().alias("last_sodium"),
    pl.col("value").mean().alias("mean_sodium")
])

🚀 Ready to Migrate?

Start your migration journey with the Tutorials and Getting Started and explore the Custom Variables Guide guide to learn modern Polars patterns!

📚 Related Documentation

Tutorials and Getting Started - Learn the Polars-native interface
Troubleshooting Guide - Common migration issues and solutions
Custom Variables Guide - Advanced variable creation patterns
Cohort - Full Polars-native Cohort documentation

API Reference #

class corr_vars.legacy_v1.Cohort(conn_args={}, logger_args={}, password_file=None, database='db_hypercapnia_prepared', extraction_end_date=None, obs_level='icu_stay', project_vars={}, merge_consecutive=True, load_default_vars=True, filters='')[source]#

Bases: Cohort

Legacy class to build a cohort in the CORR database. This version uses Pandas to access .obs and .obsm attributes. Please migrate to the new version of the cohort class at your earliest convenience.

Parameters:

conn_args (dict) – Dictionary of Database credentials [remote_hostname (str), username (str)] (default: {}).
logger_args (dict) – Dictionary of Logging configurations [level (int), file_path (str), file_mode (str), verbose_fmt (bool), colored_output (bool), formatted_numbers (bool)] (default: {}).
password_file (str | bool) – Path to the password file or True if your file is in ~/password.txt (default: None).
database (Literal["db_hypercapnia_prepared", "db_corror_prepared"]) – Database to use (default: “db_hypercapnia_prepared”).
extraction_end_date (str) – Deprecated as of Feb 6, 2025. This was used to set the end date for the extraction. Use filters instead (default: None).
obs_level (Literal["icu_stay", "hospital_stay", "procedure"]) – Observation level (default: “icu_stay”).
project_vars (dict) – Dictionary with local variable definitions (default: {}).
merge_consecutive (bool) – Whether to merge consecutive ICU stays (default: True). Does not apply to any other obs_level.
load_default_vars (bool) – Whether to load the default variables (default: True).
filters (str) – Initial filters (must be a valid SQL WHERE clause for the it_ishmed_fall table) (default: “”).

obs#

Static data for each observation. Contains one row per observation (e.g., ICU stay) with columns for static variables like demographics and outcomes.

Example

>>> cohort.obs
patient_id  case_id icu_stay_id            icu_admission        icu_discharge sex   ... inhospital_death
0  P001         C001    C001_1       2023-01-01 08:30:00  2023-01-03 12:00:00   M   ...  False
1  P001         C001    C001_2       2023-01-03 14:20:00  2023-01-05 16:30:00   M   ...  False
2  P002         C002    C002_1       2023-01-02 09:15:00  2023-01-04 10:30:00   F   ...  False
3  P003         C003    C003_1       2023-01-04 11:45:00  2023-01-07 13:20:00   F   ...  True
...

Type:: pd.DataFrame

obsm#

Dynamic data stored as dictionary of DataFrames. Each DataFrame contains time-series data for a variable with columns:

recordtime: Timestamp of the measurement
value: Value of the measurement
recordtime_end: End time (only for duration-based variables like therapies)
description: Additional information (e.g., medication names)

Example

>>> cohort.obsm["blood_sodium"]
   icu_stay_id          recordtime  value
0  C001_1      2023-01-01 09:30:00   138
1  C001_1      2023-01-02 10:15:00   141
2  C001_2      2023-01-03 15:00:00   137
3  C002_1      2023-01-02 10:00:00   142
4  C003_1      2023-01-04 12:30:00   139
...

Type:: dict of pd.DataFrame

variables#

Dictionary of all variable objects in the cohort. This is used to keep track of variable metadata.

Type:: dict of Variable

Notes

For large cohorts, set load_default_vars=False to speed up the extraction. You can use pre-extracted cohorts as starting points and load them using Cohort.load().
Variables can be added using cohort.add_variable(). Static variables will be added to obs, dynamic variables to obsm.
filters also allows a special shorthand “_dx” to extract the hospital admissions of the last x months, useful for debugging/prototyping. For example use “_d2” to extract every admission of the last 2 months.

Examples

Create a new cohort:

>>> cohort = Cohort(obs_level="icu_stay",
...                 database="db_hypercapnia_prepared",
...                 load_default_vars=False,
...                 password_file=True)

Access static data:

>>> cohort.obs["age_on_admission"]  # Get age for all patients
>>> cohort.obs.loc[cohort.obs["sex"] == "M"]  # Filter for male patients

Access time-series data:

>>> cohort.obsm["blood_sodium"]  # Get all blood sodium measurements
>>> # Get blood sodium measurements for a specific observation
>>> cohort.obsm["blood_sodium"].loc[
...     cohort.obsm["blood_sodium"][cohort.primary_key] == "12345"
... ]

__init__(conn_args={}, logger_args={}, password_file=None, database='db_hypercapnia_prepared', extraction_end_date=None, obs_level='icu_stay', project_vars={}, merge_consecutive=True, load_default_vars=True, filters='')[source]#

Parameters:

conn_args (dict)
logger_args (dict)
password_file (Union[str, bool, None])
database (Literal['db_hypercapnia_prepared', 'db_corror_prepared'])
extraction_end_date (Optional[str])
obs_level (Literal['icu_stay', 'hospital_stay', 'procedure'])
project_vars (dict)
merge_consecutive (bool)
load_default_vars (bool)
filters (str)

add_inclusion(inclusion_list=[])[source]#

Add an inclusion criteria to the cohort.

Parameters:: inclusion_list (list) – List of inclusion criteria. Must include a dictionary with keys: * variable (str | Variable): Variable to use for exclusion * operation (str): Operation to apply (e.g., “> 5”, “== True”) * label (str): Short label for the exclusion step * operations_done (str): Detailed description of what this exclusion does * tmin (str, optional): Start time for variable extraction * tmax (str, optional): End time for variable extraction
Returns:: CohortTracker object, can be used to plot inclusion chart
Return type:: ct (CohortTracker)

Note

Per default, all inclusion criteria are applied from tmin=cohort.tmin to tmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.

Examples

>>> ct = cohort.include_list([
...    {
...        "variable": "age_on_admission",
...        "operation": ">= 18",
...        "label": "Adult patients",
...        "operations_done": "Excluded patients under 18 years old"
...    }
...  ])
>>> ct.create_flowchart()

add_exclusion(exclusion_list=[])[source]#

Add an exclusion criteria to the cohort.

Parameters:

exclusion_list (list) –

List of exclusion criteria. Each criterion is a dictionary containing:

variable (str | Variable): Variable to use for exclusion
operation (str): Operation to apply (e.g., “> 5”, “== True”)
label (str): Short label for the exclusion step
operations_done (str): Detailed description of what this exclusion does
tmin (str, optional): Start time for variable extraction
tmax (str, optional): End time for variable extraction

Returns:

CohortTracker object, can be used to plot exclusion chart

Return type:

ct (CohortTracker)

Note

Per default, all exclusion criteria are applied from tmin=cohort.tmin to tmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.

Examples

>>> ct = cohort.exclude_list([
...    {
...        "variable": "any_rrt_icu",
...        "operation": "true",
...        "label": "No RRT",
...        "operations_done": "Excluded RRT before hypernatremia"
...    },
...    {
...        "variable": "any_dx_tbi",
...        "operation": "true",
...        "label": "No TBI",
...        "operations_done": "Excluded TBI before hypernatremia"
...    },
...    {
...        "variable": NativeStatic(
...            var_name="sodium_count",
...            select="!count value",
...            base_var="blood_sodium"),
...        "operation": "< 1",
...        "label": "Final cohort",
...        "operations_done": "Excluded cases with less than 1 sodium measurement after hypernatremia",
...        "tmin": cohort.t_eligible,
...        "tmax": "hospital_discharge"
...    }
...  ])
>>> ct.create_flowchart() # Plot the exclusion flowchart

save(filename, legacy=False)[source]#

Save the cohort to a .corr2 archive

Parameters:

filename (str) – Path to the .corr2 archive.
legacy (bool)

classmethod load(filename, password_file=None)[source]#

Load a cohort from a pickle file. If this file was saved by a different user, you need to pass your database credentials to the function.

Parameters:

filename (str) – Path to the pickle file.
password_file (Optional[str]) – Path to the password file.

Returns:

A new Cohort object.

Return type:

Cohort

add_variable(variable, save_as=None, tmin=None, tmax=None)[source]#

Add a variable to the cohort.

You may specify tmin and tmax as a tuple (e.g. (“hospital_admission”, “+1d”)), in which case it will be relative to the hospital admission time of the patient.

Parameters:

variable (str | Variable | MultiSourceVariable) – Variable to add. Either a string with the variable name (from vars.json) or a Variable object.
save_as (Optional[str]) – Name of the column to save the variable as. Defaults to variable name.
tmin (Union[str, tuple[str, str], None]) – Name of the column to use as tmin or tuple (see description).
tmax (Union[str, tuple[str, str], None]) – Name of the column to use as tmax or tuple (see description).

Returns:

The variable object.

Return type:

Variable

Examples

>>> cohort.add_variable("blood_sodium")

>>> cohort.add_variable(
...    variable="anx_dx_covid_19",
...    tmin=("hospital_admission", "-1d"),
...    tmax=cohort.t_eligible
... )

>>> cohort.add_variable(
...    NativeStatic(
...        var_name="highest_hct_before_eligible",
...        select="!max value",
...        base_var='blood_hematokrit',
...        tmax=cohort.t_eligible
...    )
... )

>>> cohort.add_variable(
...    variable='any_med_glu',
...    save_as="glucose_prior_eligible",
...    tmin=(cohort.t_eligible, "-48h"),
...    tmax=cohort.t_eligible
... )

add_variable_definition(var_name, var_dict)[source]#

Add or update a local variable definition.

Parameters:

var_name (str) – Name of the variable.
var_dict (dict) – Dictionary containing variable definition. Can be partial - missing fields will be inherited from global definition.

Return type:

None

Examples

Add a completely new variable:

>>> cohort.add_variable_definition("my_new_var", {
...     "type": "native_dynamic",
...     "table": "it_ishmed_labor",
...     "where": "c_katalog_leistungtext LIKE '%new%'",
...     "value_dtype": "DOUBLE",
...     "cleaning": {"value": {"low": 100, "high": 150}}
... })

Partially override existing variable:

>>> cohort.add_variable_definition("blood_sodium", {
...     "where": "c_katalog_leistungtext LIKE '%custom_sodium%'"
... })

change_tracker(description, mode='include')[source]#

Return a context manager to group cohort edits and record a single ChangeTracker state on exit.

Example

with cohort.change_tracker(“Adults”, mode=”include”) as track:: track.filter(pl.col(“age_on_admission”) >= 18)

Parameters:

description (str)
mode (Literal['include', 'exclude'])

debug_print()[source]#

Print debug information about the cohort. Please use this if you are creating a GitHub issue.

Return type:: None
Returns:: None

exclude(*args, **kwargs)[source]#

Add an exclusion criterion to the cohort. It is recommended to use Cohort.exclude_list() and add all of your exclusion criteria at once. However, if you need to specify criteria at a later stage, you can use this method.

Warning

You should call Cohort.exclude_list() before calling Cohort.exclude() to ensure that the exclusion criteria are properly tracked.

Parameters:

variable (str | Variable)
operation (str)
label (str)
operations_done (str)
[Optional – tmin, tmax]

Returns:

Criterion is added to the cohort.

Return type:

None

Note

operation is passed to pandas.DataFrame.query, which uses a slightly modified Python syntax. Also, if you specify “true”/”True” or “false”/”False” as a value for operation, it will be converted to “== True” or “== False”, respectively.

Examples

>>> cohort.exclude(
...    variable="elix_total",
...    operation="> 20",
...    operations_done="Exclude patients with high Elixhauser score"
... )

exclude_list(exclusion_list=[])[source]#

Add an exclusion criteria to the cohort.

Parameters:

exclusion_list (list) –

List of exclusion criteria. Each criterion is a dictionary containing:

variable (str | Variable): Variable to use for exclusion
operation (str): Operation to apply (e.g., “> 5”, “== True”)
label (str): Short label for the exclusion step
operations_done (str): Detailed description of what this exclusion does
tmin (str, optional): Start time for variable extraction
tmax (str, optional): End time for variable extraction

Returns:

CohortTracker object, can be used to plot exclusion chart

Return type:

ct (CohortTracker)

Note

Per default, all exclusion criteria are applied from tmin=cohort.tmin to tmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.

Examples

>>> ct = cohort.exclude_list([
...    {
...        "variable": "any_rrt_icu",
...        "operation": "true",
...        "label": "No RRT",
...        "operations_done": "Excluded RRT before hypernatremia"
...    },
...    {
...        "variable": "any_dx_tbi",
...        "operation": "true",
...        "label": "No TBI",
...        "operations_done": "Excluded TBI before hypernatremia"
...    },
...    {
...        "variable": NativeStatic(
...            var_name="sodium_count",
...            select="!count value",
...            base_var="blood_sodium"),
...        "operation": "< 1",
...        "label": "Final cohort",
...        "operations_done": "Excluded cases with less than 1 sodium measurement after hypernatremia",
...        "tmin": cohort.t_eligible,
...        "tmax": "hospital_discharge"
...    }
...  ])
>>> ct.create_flowchart() # Plot the exclusion flowchart

include(*args, **kwargs)[source]#

Add an inclusion criterion to the cohort. It is recommended to use Cohort.include_list() and add all of your inclusion criteria at once. However, if you need to specify criteria at a later stage, you can use this method.

Warning

You should call Cohort.include_list() before calling Cohort.include() to ensure that the inclusion criteria are properly tracked.

Parameters:

variable (str | Variable)
operation (str)
label (str)
operations_done (str)
[Optional – tmin, tmax]

Returns:

Criterion is added to the cohort.

Return type:

None

Note

operation is passed to pandas.DataFrame.query, which uses a slightly modified Python syntax. Also, if you specify “true”/”True” or “false”/”False” as a value for operation, it will be converted to “== True” or “== False”, respectively.

Examples

>>> cohort.include(
...    variable="age_on_admission",
...    operation=">= 18",
...    label="Adult",
...    operations_done="Include only adult patients"
... )

include_list(inclusion_list=[])[source]#

Add an inclusion criteria to the cohort.

Parameters:: inclusion_list (list) – List of inclusion criteria. Must include a dictionary with keys: * variable (str | Variable): Variable to use for exclusion * operation (str): Operation to apply (e.g., “> 5”, “== True”) * label (str): Short label for the exclusion step * operations_done (str): Detailed description of what this exclusion does * tmin (str, optional): Start time for variable extraction * tmax (str, optional): End time for variable extraction
Returns:: CohortTracker object, can be used to plot inclusion chart
Return type:: ct (CohortTracker)

Note

Per default, all inclusion criteria are applied from tmin=cohort.tmin to tmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.

Examples

>>> ct = cohort.include_list([
...    {
...        "variable": "age_on_admission",
...        "operation": ">= 18",
...        "label": "Adult patients",
...        "operations_done": "Excluded patients under 18 years old"
...    }
...  ])
>>> ct.create_flowchart()

load_default_vars()[source]#

Load the default variables defined in vars.json. It is recommended to use this after filtering your cohort for eligibility to speed up the process.

Returns:: Variables are loaded into the cohort.
Return type:: None

Examples

>>> # Load default variables for an ICU cohort
>>> cohort = Cohort(obs_level="icu_stay", load_default_vars=False)
>>> # Apply filters first (faster)
>>> cohort.include_list([
...     {"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"}
... ])
>>> # Then load default variables
>>> cohort.load_default_vars()

set_t_eligible(t_eligible, drop_ineligible=True)[source]#

Set the time anchor for eligibility. This can be referenced as cohort.t_eligible throughout the process and is required to add inclusion or exclusion criteria.

Parameters:

t_eligible (str) – Name of the column to use as t_eligible.
drop_ineligible (bool) – Whether to drop ineligible patients. Defaults to True.

Returns:

t_eligible is set.

Return type:

None

Examples

>>> # Add a suitable time-anchor variable
>>> cohort.add_variable(NativeStatic(
...    var_name="spo2_lt_90",
...    base_var="spo2",
...    select="!first recordtime",
...    where="value < 90",
... ))
>>> # Set the time anchor for eligibility
>>> cohort.set_t_eligible("spo2_lt_90")

set_t_outcome(t_outcome)[source]#

Set the time anchor for outcome. This can be referenced as cohort.t_outcome throughout the process and is recommended to specify for your study.

Parameters:: t_outcome (str) – Name of the column to use as t_outcome.
Returns:: t_outcome is set.
Return type:: None

Examples

>>> cohort.set_t_outcome("hospital_discharge")

property stata: DataFrame | None#

Convert the cohort to a Stata DataFrame. You may use cohort.stata to access the dataframe directly. Note that you need to save it to a top-level variable to access it via %%stata.

Parameters:

df (pd.DataFrame) – The DataFrame to be converted to Stata format. Will default to the obs DataFrame if unspecified (default: None)
convert_dates (dict[Hashable, str]) – Dictionary of columns to convert to Stata date format.
write_index (bool) – Whether to write the index as a column.
to_file (str | None) – Path to save as .dta file. If left unspecified, the DataFrame will not be saved.

Returns:

A Pandas Dataframe compatible with Stata if to_file is None.

Return type:

pd.DataFrame

property tableone: TableOne#

Create a TableOne object for the cohort.

Parameters:

ignore_cols (list | str) – Column(s) to ignore.
groupby (str) – Column to group by.
filter (str) – Filter to apply to the data.
pval (bool) – Whether to calculate p-values.
normal_cols (list[str]) – Columns to treat as normally distributed.
**kwargs – Additional arguments to pass to TableOne.

Returns:

A TableOne object.

Return type:

TableOne

Examples

>>> tableone = cohort.tableone()
>>> print(tableone)
>>> tableone.to_csv("tableone.csv")

>>> tableone = cohort.tableone(groupby="sex", pval=False)
>>> print(tableone)
>>> tableone.to_csv("tableone_sex.csv")

to_csv(folder)[source]#

Save the cohort to CSV files.

Parameters:: folder (str) – Path to the folder where CSV files will be saved.
Return type:: None

Examples

>>> cohort.to_csv("output_data")
>>> # Creates:
>>> # output_data/_obs.csv
>>> # output_data/blood_sodium.csv
>>> # output_data/heart_rate.csv
>>> # ... (one file per variable)

to_stata(df=None, convert_dates=None, write_index=True, to_file=None)[source]#

Convert the cohort to a Stata DataFrame. You may use cohort.stata to access the dataframe directly. Note that you need to save it to a top-level variable to access it via %%stata.

Parameters:

df (pd.DataFrame) – The DataFrame to be converted to Stata format. Will default to the obs DataFrame if unspecified (default: None)
convert_dates (dict[Hashable, str]) – Dictionary of columns to convert to Stata date format.
write_index (bool) – Whether to write the index as a column.
to_file (str | None) – Path to save as .dta file. If left unspecified, the DataFrame will not be saved.

Returns:

A Pandas Dataframe compatible with Stata if to_file is None.

Return type:

pd.DataFrame

to_tableone(ignore_cols=[], groupby=None, filter=None, pval=False, normal_cols=[], **kwargs)[source]#

Create a TableOne object for the cohort.

Parameters:

ignore_cols (list | str) – Column(s) to ignore.
groupby (str) – Column to group by.
filter (str) – Filter to apply to the data.
pval (bool) – Whether to calculate p-values.
normal_cols (list[str]) – Columns to treat as normally distributed.
**kwargs – Additional arguments to pass to TableOne.

Returns:

A TableOne object.

Return type:

TableOne

Examples

>>> tableone = cohort.tableone()
>>> print(tableone)
>>> tableone.to_csv("tableone.csv")

>>> tableone = cohort.tableone(groupby="sex", pval=False)
>>> print(tableone)
>>> tableone.to_csv("tableone_sex.csv")

Legacy Interface (Pandas-Based)

Contents

Legacy Interface (Pandas-Based)#