Legacy Interface (Pandas-Based)#
π Migration Recommended
For new projects, use the Polars-native interface (from corr_vars import Cohort) which is 2-10x faster and more memory-efficient.
This legacy interface is provided for backward compatibility only.
Overview#
The legacy interface provides a Pandas-based wrapper around the new Polars-native CORR-Vars interface. It was created to maintain backward compatibility with existing code that uses Pandas DataFrame methods, allowing researchers to continue using their existing analysis pipelines without immediate rewrites.
Aspect |
Legacy Interface πΌ |
Polars-Native Interface β‘ |
|---|---|---|
Import |
|
|
Data Access |
|
|
Performance |
Slower (conversion overhead) |
2-10x faster |
Memory Usage |
Higher (dual storage) |
Lower (single storage) |
Syntax |
Familiar Pandas syntax |
Modern Polars expressions |
Recommendation |
Existing code migration |
New projects |
What is the Legacy Interface?#
Note
Architecture Overview
The legacy interface is a compatibility layer that bridges old and new:
.loc[], .groupby(), .head()
Automatic conversion
Fast & efficient
Key Features:
Seamlessly converts between Polars (internal) and Pandas (user-facing) representations
Preserves familiar Pandas methods like .loc[], .iloc[], .groupby()
Uses the new Polars backend internally for improved performance and stability
Maintains compatibility with existing analysis scripts and workflows
Using the Legacy Interface#
# Legacy interface (Pandas access)
from corr_vars.legacy_v1 import Cohort
cohort = Cohort(
obs_level="icu_stay",
database="db_hypercapnia_prepared",
password_file=True
)
# Pandas DataFrame access
print(type(cohort.obs)) # pandas.DataFrame wrapper
print(type(cohort.obsm)) # dict of pandas.DataFrame wrappers
# Polars-native interface (Recommended)
from corr_vars import Cohort
cohort = Cohort(
obs_level="icu_stay",
sources={"cub_hdp": {"database": "db_hypercapnia_prepared", "password_file": True}}
)
# Polars DataFrame access
print(type(cohort.obs)) # polars.DataFrame
print(type(cohort.obsm)) # dict of polars.DataFrame
π‘ Quick Start Guide
Step 1: Import the legacy interface
Step 2: Create your cohort with familiar parameters
Step 3: Use standard Pandas syntax for analysis
Pandas-Style Data Access:
# Static data access (exactly like Pandas)
print(cohort.obs.head()) # First 5 rows
print(cohort.obs.shape) # (n_rows, n_cols)
print(cohort.obs.columns.tolist()) # Column names
# Familiar Pandas indexing and filtering
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
males = cohort.obs.loc[cohort.obs["sex"] == "M"]
specific_patient = cohort.obs.iloc[0]
# Pandas aggregation methods
summary = cohort.obs.groupby("sex").agg({
"age_on_admission": ["mean", "std"],
"inhospital_death": "sum"
})
Time-Series Data Access:
# Add dynamic variable
cohort.add_variable("blood_sodium")
# Access time-series data (Pandas DataFrame)
sodium_data = cohort.obsm["blood_sodium"]
print(type(sodium_data)) # <class 'LegacyObsmDataframe'> (behaves like pd.DataFrame)
# Familiar Pandas time-series operations
patient_data = sodium_data[sodium_data["icu_stay_id"] == "12345"]
daily_avg = sodium_data.groupby(sodium_data["recordtime"].dt.date)["value"].mean()
# Standard Pandas methods work
print(sodium_data.describe())
print(sodium_data.value_counts())
Column Assignment (Limited):
# Direct column assignment works for obs
cohort.obs["bmi_category"] = cohort.obs["weight"] / (cohort.obs["height"] / 100) ** 2
cohort.obs["is_elderly"] = cohort.obs["age_on_admission"] > 65
# Note: obsm DataFrames are read-only to prevent data corruption
# sodium_data["new_col"] = 1 # This will raise NotImplementedError
Pandas Method Compatibility#
β Full Pandas Compatibility
The legacy interface supports most common Pandas DataFrame methods out of the box!
.head(),.tail(),.info().describe(),.shape,.columns.nunique(),.value_counts().isnull(),.dtypes
.loc[],.iloc[],.at[],.iat[].query(), boolean indexing.filter(),.select_dtypes()
.groupby(),.pivot_table().merge(),.join().sort_values(),.drop().drop_duplicates()
.mean(),.median(),.std().corr(),.agg(),.apply().transform()
.resample(),.rolling().expanding()DateTime indexing
Direct plotting with matplotlib
Seaborn compatibility
Works with existing viz code
Limitations of the Legacy Interface#
Warning
Important Limitations to Consider
While the legacy interface maintains compatibility, it has several important limitations that may affect performance and functionality.
Detailed Limitation Analysis
Understanding these limitations will help you decide when to migrate to the Polars-native interface.
Performance Limitations#
# Legacy interface: Data conversion overhead
large_cohort = Cohort(obs_level="icu_stay", load_default_vars=True) # Slower
# Polars-native: Direct access, no conversion
from corr_vars import Cohort as PolarsCohort
fast_cohort = PolarsCohort(obs_level="icu_stay", load_default_vars=True) # Faster
Memory Overhead: Data is stored in Polars but converted to Pandas for access, requiring additional memory
Conversion Costs: Each access to
.obsor.obsmtriggers Polars β Pandas conversionLarge Dataset Issues: Very large cohorts may hit memory limits during conversion
Slower Operations: Pandas operations are generally slower than equivalent Polars operations
Functional Limitations#
# 1. Limited obsm modification
cohort.obsm["blood_sodium"]["new_column"] = 1 # NotImplementedError
# 2. No direct polars access
# cohort._obs.filter(pl.col("age") > 18) # Not recommended, internal API
# 3. Some advanced Polars features unavailable
# No lazy evaluation, no expression API
Read-Only obsm: Time-series DataFrames (
obsm) are read-only to prevent data corruptionNo Polars Expression API: Cannot use Polarsβ powerful expression syntax
No Lazy Evaluation: Cannot benefit from Polarsβ lazy evaluation optimizations
Limited Parallel Processing: Pandas operations are less optimized for parallel execution
Data Type Limitations#
# Some Polars data types don't translate perfectly to Pandas
# May lose precision or type information in edge cases
print(cohort.obs.dtypes) # May show different types than native Polars
Type Conversion Issues: Some Polars types may not translate perfectly to Pandas
Precision Loss: Potential precision loss in numeric conversions
Missing Value Handling: Different null/missing value semantics between libraries
Migration Guide: Legacy β Polars-Native#
Step 1: Update Imports
# Before (Legacy)
from corr_vars.legacy_v1 import Cohort
# After (Polars-native)
from corr_vars import Cohort
Step 2: Update Data Access Patterns
# Legacy Pandas syntax
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
male_patients = cohort.obs.loc[cohort.obs["sex"] == "M"]
# Polars-native equivalent
adults = cohort.obs.filter(pl.col("age_on_admission") >= 18)
male_patients = cohort.obs.filter(pl.col("sex") == "M")
Step 3: Update Aggregations
# Legacy Pandas groupby
summary = cohort.obs.groupby("sex").agg({
"age_on_admission": ["mean", "std"],
"inhospital_death": "sum"
})
# Polars-native equivalent
summary = cohort.obs.group_by("sex").agg([
pl.col("age_on_admission").mean().alias("age_mean"),
pl.col("age_on_admission").std().alias("age_std"),
pl.col("inhospital_death").sum().alias("deaths")
])
Step 4: Update Time-Series Operations
# Legacy Pandas time-series
patient_data = cohort.obsm["blood_sodium"][
cohort.obsm["blood_sodium"]["icu_stay_id"] == "12345"
]
# Polars-native equivalent
patient_data = cohort.obsm["blood_sodium"].filter(
pl.col("icu_stay_id") == "12345"
)
Benefits of Migration:
2-10x Performance Improvement for most operations
Lower Memory Usage (no conversion overhead)
Better Type Safety and error handling
Access to Modern Features like lazy evaluation and expression API
Future-Proof Code as legacy interface may be deprecated
When to Use Legacy vs. Polars-Native#
- β Migrating Existing Code
You have extensive Pandas-based analysis pipelines
- β Team Training Time
Your team needs time to learn Polars syntax
- β External Dependencies
Your code integrates with Pandas-only libraries
- β Proof of Concepts
Quick prototyping with familiar syntax
- β οΈ Temporary Migration Step
Use as stepping stone to Polars-native
- β New Projects
Starting fresh analysis projects
- β Performance Critical
Working with large datasets or complex operations
- β Memory Constrained
Limited memory environments
- β Production Code
Building robust, long-term analysis pipelines
- β Modern Features
Want to leverage advanced Polars capabilities
- π Future-Proof Choice
Recommended for all new development
π― Decision Matrix
Your Situation |
Legacy Interface πΌ |
Polars-Native β‘ |
|---|---|---|
New research project |
β Not recommended |
β Recommended |
Existing Pandas codebase |
β Good transition option |
π Migrate gradually |
Large datasets (>1GB) |
β οΈ Performance issues |
β Optimal performance |
Team learning curve |
β Familiar syntax |
π Investment in learning |
Production deployment |
β οΈ Legacy, may deprecate |
β Future-proof |
Example: Side-by-Side Comparison#
π Real-World Performance Example
Both examples below produce identical results, but with very different performance characteristics.
from corr_vars.legacy_v1 import Cohort
# Create cohort (slower initialization)
cohort = Cohort(obs_level="icu_stay", database="db_hypercapnia_prepared")
# Pandas-style analysis (familiar syntax)
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
summary = adults.groupby("sex").agg({
"age_on_admission": "mean",
"inhospital_death": "sum"
})
# Time-series analysis
sodium = cohort.obsm["blood_sodium"]
patient_trends = sodium.groupby("icu_stay_id")["value"].agg(["first", "last", "mean"])
from corr_vars import Cohort
import polars as pl
# Create cohort (faster initialization)
cohort = Cohort(obs_level="icu_stay", sources={"cub_hdp": {"database": "db_hypercapnia_prepared"}})
# Polars-style analysis (faster execution)
summary = cohort.obs.filter(pl.col("age_on_admission") >= 18).group_by("sex").agg([
pl.col("age_on_admission").mean().alias("mean_age"),
pl.col("inhospital_death").sum().alias("deaths")
])
# Time-series analysis (more efficient)
patient_trends = cohort.obsm["blood_sodium"].group_by("icu_stay_id").agg([
pl.col("value").first().alias("first_sodium"),
pl.col("value").last().alias("last_sodium"),
pl.col("value").mean().alias("mean_sodium")
])
π Ready to Migrate?
Start your migration journey with the Tutorials and Getting Started and explore the Custom Variables Guide guide to learn modern Polars patterns!
π Related Documentation
Tutorials and Getting Started - Learn the Polars-native interface
Troubleshooting Guide - Common migration issues and solutions
Custom Variables Guide - Advanced variable creation patterns
Cohort - Full Polars-native Cohort documentation
API Reference#
- class corr_vars.legacy_v1.Cohort(conn_args={}, logger_args={}, password_file=None, database='db_hypercapnia_prepared', extraction_end_date=None, obs_level='icu_stay', project_vars={}, merge_consecutive=True, load_default_vars=True, filters='')[source]#
Bases:
CohortLegacy class to build a cohort in the CORR database. This version uses Pandas to access .obs and .obsm attributes. Please migrate to the new version of the cohort class at your earliest convenience.
- Parameters:
conn_args (dict) β Dictionary of Database credentials [remote_hostname (str), username (str)] (default: {}).
logger_args (dict) β Dictionary of Logging configurations [level (int), file_path (str), file_mode (str), verbose_fmt (bool), colored_output (bool), formatted_numbers (bool)] (default: {}).
password_file (str | bool) β Path to the password file or True if your file is in ~/password.txt (default: None).
database (Literal["db_hypercapnia_prepared", "db_corror_prepared"]) β Database to use (default: βdb_hypercapnia_preparedβ).
extraction_end_date (str) β Deprecated as of Feb 6, 2025. This was used to set the end date for the extraction. Use filters instead (default: None).
obs_level (Literal["icu_stay", "hospital_stay", "procedure"]) β Observation level (default: βicu_stayβ).
project_vars (dict) β Dictionary with local variable definitions (default: {}).
merge_consecutive (bool) β Whether to merge consecutive ICU stays (default: True). Does not apply to any other obs_level.
load_default_vars (bool) β Whether to load the default variables (default: True).
filters (str) β Initial filters (must be a valid SQL WHERE clause for the it_ishmed_fall table) (default: ββ).
- obs#
Static data for each observation. Contains one row per observation (e.g., ICU stay) with columns for static variables like demographics and outcomes.
Example
>>> cohort.obs patient_id case_id icu_stay_id icu_admission icu_discharge sex ... inhospital_death 0 P001 C001 C001_1 2023-01-01 08:30:00 2023-01-03 12:00:00 M ... False 1 P001 C001 C001_2 2023-01-03 14:20:00 2023-01-05 16:30:00 M ... False 2 P002 C002 C002_1 2023-01-02 09:15:00 2023-01-04 10:30:00 F ... False 3 P003 C003 C003_1 2023-01-04 11:45:00 2023-01-07 13:20:00 F ... True ...
- Type:
pd.DataFrame
- obsm#
Dynamic data stored as dictionary of DataFrames. Each DataFrame contains time-series data for a variable with columns:
recordtime: Timestamp of the measurement
value: Value of the measurement
recordtime_end: End time (only for duration-based variables like therapies)
description: Additional information (e.g., medication names)
Example
>>> cohort.obsm["blood_sodium"] icu_stay_id recordtime value 0 C001_1 2023-01-01 09:30:00 138 1 C001_1 2023-01-02 10:15:00 141 2 C001_2 2023-01-03 15:00:00 137 3 C002_1 2023-01-02 10:00:00 142 4 C003_1 2023-01-04 12:30:00 139 ...
- Type:
dict of pd.DataFrame
- variables#
Dictionary of all variable objects in the cohort. This is used to keep track of variable metadata.
- Type:
dict of Variable
Notes
For large cohorts, set
load_default_vars=Falseto speed up the extraction. You can use pre-extracted cohorts as starting points and load them usingCohort.load().Variables can be added using
cohort.add_variable(). Static variables will be added toobs, dynamic variables toobsm.filtersalso allows a special shorthand β_dxβ to extract the hospital admissions of the last x months, useful for debugging/prototyping. For example use β_d2β to extract every admission of the last 2 months.
Examples
Create a new cohort:
>>> cohort = Cohort(obs_level="icu_stay", ... database="db_hypercapnia_prepared", ... load_default_vars=False, ... password_file=True)
Access static data:
>>> cohort.obs["age_on_admission"] # Get age for all patients >>> cohort.obs.loc[cohort.obs["sex"] == "M"] # Filter for male patients
Access time-series data:
>>> cohort.obsm["blood_sodium"] # Get all blood sodium measurements >>> # Get blood sodium measurements for a specific observation >>> cohort.obsm["blood_sodium"].loc[ ... cohort.obsm["blood_sodium"][cohort.primary_key] == "12345" ... ]
- __init__(conn_args={}, logger_args={}, password_file=None, database='db_hypercapnia_prepared', extraction_end_date=None, obs_level='icu_stay', project_vars={}, merge_consecutive=True, load_default_vars=True, filters='')[source]#
- Parameters:
conn_args (
dict)logger_args (
dict)password_file (
Union[str,bool,None])database (
Literal['db_hypercapnia_prepared','db_corror_prepared'])extraction_end_date (
Optional[str])obs_level (
Literal['icu_stay','hospital_stay','procedure'])project_vars (
dict)merge_consecutive (
bool)load_default_vars (
bool)filters (
str)
- add_inclusion(inclusion_list=[])[source]#
Add an inclusion criteria to the cohort.
- Parameters:
inclusion_list (list) β List of inclusion criteria. Must include a dictionary with keys: *
variable(str | Variable): Variable to use for exclusion *operation(str): Operation to apply (e.g., β> 5β, β== Trueβ) *label(str): Short label for the exclusion step *operations_done(str): Detailed description of what this exclusion does *tmin(str, optional): Start time for variable extraction *tmax(str, optional): End time for variable extraction- Returns:
CohortTracker object, can be used to plot inclusion chart
- Return type:
ct (CohortTracker)
Note
Per default, all inclusion criteria are applied from
tmin=cohort.tmintotmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.Examples
>>> ct = cohort.include_list([ ... { ... "variable": "age_on_admission", ... "operation": ">= 18", ... "label": "Adult patients", ... "operations_done": "Excluded patients under 18 years old" ... } ... ]) >>> ct.create_flowchart()
- add_exclusion(exclusion_list=[])[source]#
Add an exclusion criteria to the cohort.
- Parameters:
exclusion_list (list) β
List of exclusion criteria. Each criterion is a dictionary containing:
variable(str | Variable): Variable to use for exclusionoperation(str): Operation to apply (e.g., β> 5β, β== Trueβ)label(str): Short label for the exclusion stepoperations_done(str): Detailed description of what this exclusion doestmin(str, optional): Start time for variable extractiontmax(str, optional): End time for variable extraction
- Returns:
CohortTracker object, can be used to plot exclusion chart
- Return type:
ct (CohortTracker)
Note
Per default, all exclusion criteria are applied from
tmin=cohort.tmintotmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.Examples
>>> ct = cohort.exclude_list([ ... { ... "variable": "any_rrt_icu", ... "operation": "true", ... "label": "No RRT", ... "operations_done": "Excluded RRT before hypernatremia" ... }, ... { ... "variable": "any_dx_tbi", ... "operation": "true", ... "label": "No TBI", ... "operations_done": "Excluded TBI before hypernatremia" ... }, ... { ... "variable": NativeStatic( ... var_name="sodium_count", ... select="!count value", ... base_var="blood_sodium"), ... "operation": "< 1", ... "label": "Final cohort", ... "operations_done": "Excluded cases with less than 1 sodium measurement after hypernatremia", ... "tmin": cohort.t_eligible, ... "tmax": "hospital_discharge" ... } ... ]) >>> ct.create_flowchart() # Plot the exclusion flowchart
- save(filename, legacy=False)[source]#
Save the cohort to a .corr2 archive
- Parameters:
filename (
str) β Path to the .corr2 archive.legacy (
bool)
- classmethod load(filename, password_file=None)[source]#
Load a cohort from a pickle file. If this file was saved by a different user, you need to pass your database credentials to the function.
- Parameters:
filename (
str) β Path to the pickle file.password_file (
Optional[str]) β Path to the password file.
- Returns:
A new Cohort object.
- Return type:
- add_variable(variable, save_as=None, tmin=None, tmax=None)[source]#
Add a variable to the cohort.
You may specify tmin and tmax as a tuple (e.g. (βhospital_admissionβ, β+1dβ)), in which case it will be relative to the hospital admission time of the patient.
- Parameters:
variable (
str|Variable|MultiSourceVariable) β Variable to add. Either a string with the variable name (from vars.json) or a Variable object.save_as (
Optional[str]) β Name of the column to save the variable as. Defaults to variable name.tmin (
Union[str,tuple[str,str],None]) β Name of the column to use as tmin or tuple (see description).tmax (
Union[str,tuple[str,str],None]) β Name of the column to use as tmax or tuple (see description).
- Returns:
The variable object.
- Return type:
Examples
>>> cohort.add_variable("blood_sodium")
>>> cohort.add_variable( ... variable="anx_dx_covid_19", ... tmin=("hospital_admission", "-1d"), ... tmax=cohort.t_eligible ... )
>>> cohort.add_variable( ... NativeStatic( ... var_name="highest_hct_before_eligible", ... select="!max value", ... base_var='blood_hematokrit', ... tmax=cohort.t_eligible ... ) ... )
>>> cohort.add_variable( ... variable='any_med_glu', ... save_as="glucose_prior_eligible", ... tmin=(cohort.t_eligible, "-48h"), ... tmax=cohort.t_eligible ... )
- add_variable_definition(var_name, var_dict)[source]#
Add or update a local variable definition.
- Parameters:
var_name (str) β Name of the variable.
var_dict (dict) β Dictionary containing variable definition. Can be partial - missing fields will be inherited from global definition.
- Return type:
None
Examples
Add a completely new variable:
>>> cohort.add_variable_definition("my_new_var", { ... "type": "native_dynamic", ... "table": "it_ishmed_labor", ... "where": "c_katalog_leistungtext LIKE '%new%'", ... "value_dtype": "DOUBLE", ... "cleaning": {"value": {"low": 100, "high": 150}} ... })
Partially override existing variable:
>>> cohort.add_variable_definition("blood_sodium", { ... "where": "c_katalog_leistungtext LIKE '%custom_sodium%'" ... })
- change_tracker(description, mode='include')[source]#
Return a context manager to group cohort edits and record a single ChangeTracker state on exit.
Example
- with cohort.change_tracker(βAdultsβ, mode=βincludeβ) as track:
track.filter(pl.col(βage_on_admissionβ) >= 18)
- Parameters:
description (
str)mode (
Literal['include','exclude'])
- debug_print()[source]#
Print debug information about the cohort. Please use this if you are creating a GitHub issue.
- Return type:
None- Returns:
None
- exclude(*args, **kwargs)[source]#
Add an exclusion criterion to the cohort. It is recommended to use
Cohort.exclude_list()and add all of your exclusion criteria at once. However, if you need to specify criteria at a later stage, you can use this method.Warning
You should call
Cohort.exclude_list()before callingCohort.exclude()to ensure that the exclusion criteria are properly tracked.- Parameters:
variable (str | Variable)
operation (str)
label (str)
operations_done (str)
[Optional β tmin, tmax]
- Returns:
Criterion is added to the cohort.
- Return type:
None
Note
operationis passed topandas.DataFrame.query, which uses a slightly modified Python syntax. Also, if you specify βtrueβ/βTrueβ or βfalseβ/βFalseβ as a value foroperation, it will be converted to β== Trueβ or β== Falseβ, respectively.Examples
>>> cohort.exclude( ... variable="elix_total", ... operation="> 20", ... operations_done="Exclude patients with high Elixhauser score" ... )
- exclude_list(exclusion_list=[])[source]#
Add an exclusion criteria to the cohort.
- Parameters:
exclusion_list (list) β
List of exclusion criteria. Each criterion is a dictionary containing:
variable(str | Variable): Variable to use for exclusionoperation(str): Operation to apply (e.g., β> 5β, β== Trueβ)label(str): Short label for the exclusion stepoperations_done(str): Detailed description of what this exclusion doestmin(str, optional): Start time for variable extractiontmax(str, optional): End time for variable extraction
- Returns:
CohortTracker object, can be used to plot exclusion chart
- Return type:
ct (CohortTracker)
Note
Per default, all exclusion criteria are applied from
tmin=cohort.tmintotmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.Examples
>>> ct = cohort.exclude_list([ ... { ... "variable": "any_rrt_icu", ... "operation": "true", ... "label": "No RRT", ... "operations_done": "Excluded RRT before hypernatremia" ... }, ... { ... "variable": "any_dx_tbi", ... "operation": "true", ... "label": "No TBI", ... "operations_done": "Excluded TBI before hypernatremia" ... }, ... { ... "variable": NativeStatic( ... var_name="sodium_count", ... select="!count value", ... base_var="blood_sodium"), ... "operation": "< 1", ... "label": "Final cohort", ... "operations_done": "Excluded cases with less than 1 sodium measurement after hypernatremia", ... "tmin": cohort.t_eligible, ... "tmax": "hospital_discharge" ... } ... ]) >>> ct.create_flowchart() # Plot the exclusion flowchart
- include(*args, **kwargs)[source]#
Add an inclusion criterion to the cohort. It is recommended to use
Cohort.include_list()and add all of your inclusion criteria at once. However, if you need to specify criteria at a later stage, you can use this method.Warning
You should call
Cohort.include_list()before callingCohort.include()to ensure that the inclusion criteria are properly tracked.- Parameters:
variable (str | Variable)
operation (str)
label (str)
operations_done (str)
[Optional β tmin, tmax]
- Returns:
Criterion is added to the cohort.
- Return type:
None
Note
operationis passed topandas.DataFrame.query, which uses a slightly modified Python syntax. Also, if you specify βtrueβ/βTrueβ or βfalseβ/βFalseβ as a value foroperation, it will be converted to β== Trueβ or β== Falseβ, respectively.Examples
>>> cohort.include( ... variable="age_on_admission", ... operation=">= 18", ... label="Adult", ... operations_done="Include only adult patients" ... )
- include_list(inclusion_list=[])[source]#
Add an inclusion criteria to the cohort.
- Parameters:
inclusion_list (list) β List of inclusion criteria. Must include a dictionary with keys: *
variable(str | Variable): Variable to use for exclusion *operation(str): Operation to apply (e.g., β> 5β, β== Trueβ) *label(str): Short label for the exclusion step *operations_done(str): Detailed description of what this exclusion does *tmin(str, optional): Start time for variable extraction *tmax(str, optional): End time for variable extraction- Returns:
CohortTracker object, can be used to plot inclusion chart
- Return type:
ct (CohortTracker)
Note
Per default, all inclusion criteria are applied from
tmin=cohort.tmintotmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.Examples
>>> ct = cohort.include_list([ ... { ... "variable": "age_on_admission", ... "operation": ">= 18", ... "label": "Adult patients", ... "operations_done": "Excluded patients under 18 years old" ... } ... ]) >>> ct.create_flowchart()
- load_default_vars()[source]#
Load the default variables defined in
vars.json. It is recommended to use this after filtering your cohort for eligibility to speed up the process.- Returns:
Variables are loaded into the cohort.
- Return type:
None
Examples
>>> # Load default variables for an ICU cohort >>> cohort = Cohort(obs_level="icu_stay", load_default_vars=False) >>> # Apply filters first (faster) >>> cohort.include_list([ ... {"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"} ... ]) >>> # Then load default variables >>> cohort.load_default_vars()
- set_t_eligible(t_eligible, drop_ineligible=True)[source]#
Set the time anchor for eligibility. This can be referenced as cohort.t_eligible throughout the process and is required to add inclusion or exclusion criteria.
- Parameters:
t_eligible (
str) β Name of the column to use as t_eligible.drop_ineligible (
bool) β Whether to drop ineligible patients. Defaults to True.
- Returns:
t_eligible is set.
- Return type:
None
Examples
>>> # Add a suitable time-anchor variable >>> cohort.add_variable(NativeStatic( ... var_name="spo2_lt_90", ... base_var="spo2", ... select="!first recordtime", ... where="value < 90", ... )) >>> # Set the time anchor for eligibility >>> cohort.set_t_eligible("spo2_lt_90")
- set_t_outcome(t_outcome)[source]#
Set the time anchor for outcome. This can be referenced as cohort.t_outcome throughout the process and is recommended to specify for your study.
- Parameters:
t_outcome (str) β Name of the column to use as t_outcome.
- Returns:
t_outcome is set.
- Return type:
None
Examples
>>> cohort.set_t_outcome("hospital_discharge")
- property stata: DataFrame | None#
Convert the cohort to a Stata DataFrame. You may use cohort.stata to access the dataframe directly. Note that you need to save it to a top-level variable to access it via %%stata.
- Parameters:
df (pd.DataFrame) β The DataFrame to be converted to Stata format. Will default to the obs DataFrame if unspecified (default: None)
convert_dates (dict[Hashable, str]) β Dictionary of columns to convert to Stata date format.
write_index (bool) β Whether to write the index as a column.
to_file (str | None) β Path to save as .dta file. If left unspecified, the DataFrame will not be saved.
- Returns:
A Pandas Dataframe compatible with Stata if to_file is None.
- Return type:
pd.DataFrame
- property tableone: TableOne#
Create a TableOne object for the cohort.
- Parameters:
ignore_cols (list | str) β Column(s) to ignore.
groupby (str) β Column to group by.
filter (str) β Filter to apply to the data.
pval (bool) β Whether to calculate p-values.
normal_cols (list[str]) β Columns to treat as normally distributed.
**kwargs β Additional arguments to pass to TableOne.
- Returns:
A TableOne object.
- Return type:
TableOne
Examples
>>> tableone = cohort.tableone() >>> print(tableone) >>> tableone.to_csv("tableone.csv")
>>> tableone = cohort.tableone(groupby="sex", pval=False) >>> print(tableone) >>> tableone.to_csv("tableone_sex.csv")
- to_csv(folder)[source]#
Save the cohort to CSV files.
- Parameters:
folder (str) β Path to the folder where CSV files will be saved.
- Return type:
None
Examples
>>> cohort.to_csv("output_data") >>> # Creates: >>> # output_data/_obs.csv >>> # output_data/blood_sodium.csv >>> # output_data/heart_rate.csv >>> # ... (one file per variable)
- to_stata(df=None, convert_dates=None, write_index=True, to_file=None)[source]#
Convert the cohort to a Stata DataFrame. You may use cohort.stata to access the dataframe directly. Note that you need to save it to a top-level variable to access it via %%stata.
- Parameters:
df (pd.DataFrame) β The DataFrame to be converted to Stata format. Will default to the obs DataFrame if unspecified (default: None)
convert_dates (dict[Hashable, str]) β Dictionary of columns to convert to Stata date format.
write_index (bool) β Whether to write the index as a column.
to_file (str | None) β Path to save as .dta file. If left unspecified, the DataFrame will not be saved.
- Returns:
A Pandas Dataframe compatible with Stata if to_file is None.
- Return type:
pd.DataFrame
- to_tableone(ignore_cols=[], groupby=None, filter=None, pval=False, normal_cols=[], **kwargs)[source]#
Create a TableOne object for the cohort.
- Parameters:
ignore_cols (list | str) β Column(s) to ignore.
groupby (str) β Column to group by.
filter (str) β Filter to apply to the data.
pval (bool) β Whether to calculate p-values.
normal_cols (list[str]) β Columns to treat as normally distributed.
**kwargs β Additional arguments to pass to TableOne.
- Returns:
A TableOne object.
- Return type:
TableOne
Examples
>>> tableone = cohort.tableone() >>> print(tableone) >>> tableone.to_csv("tableone.csv")
>>> tableone = cohort.tableone(groupby="sex", pval=False) >>> print(tableone) >>> tableone.to_csv("tableone_sex.csv")