Cohort#
Cohort Workflow#
Initialization#
With this snippet, you can initialize a cohort object using the CUB-HDP data source.
cohort = Cohort(
obs_level="icu_stay", # One of: "patient", "hospital_stay", "icu_stay", "procedure"
load_default_vars=False, # Optional, defaults to True
sources={
"cub_hdp": {
"database": "db_hypercapnia_prepared",
"conn_args": {"password_file": True}, # True assumes ~/password.txt
"icu_stay": {"merge_consecutive": False},
}
},
)
Specify multiple data sources to combine cohorts.
cohort = Cohort(
obs_level="icu_stay", # One of: "patient", "hospital_stay", "icu_stay", "procedure"
load_default_vars=False, # Optional, defaults to True
sources={
"cub_hdp": {
"database": "db_hypercapnia_prepared",
"conn_args": {"password_file": True}, # True assumes ~/password.txt
"icu_stay": {"merge_consecutive": False},
},
"reprodicu": {
"path": "/data02/projects/reprodicubility/reprodICU/reprodICU_files"
}
},
)
Accessing Data#
After initialization, cohort data is available through two attributes:
cohort.obs— a Polars DataFrame with one row per observation (static variables, demographics, outcomes).cohort.obsm— a dictionary of Polars DataFrames, one per dynamic (time-series) variable.
# Inspect static data
print(cohort.obs)
cohort.obs.select("age_on_admission")
cohort.obs.filter(pl.col("sex") == "M")
# Access a dynamic variable
print(cohort.obsm["blood_sodium"])
# Filter time-series data for a specific observation
cohort.obsm["blood_sodium"].filter(
pl.col(cohort.primary_key) == "12345"
)
# Check which dynamic variables have been extracted
print(list(cohort.obsm.keys()))
You can assign a new DataFrame back to cohort.obs to add computed columns:
cohort.obs = cohort.obs.with_columns([
pl.col('first_sodium_recordtime').eq(pl.col('first_severe_hypernatremia_recordtime'))
.alias('idx_hypernatremia_was_on_admission')
])
cohort.obs = cohort.obs.with_columns([
pl.when(pl.col('idx_hypernatremia_was_on_admission'))
.then(pl.lit('community_acquired'))
.otherwise(pl.lit('hospital_acquired'))
.alias('hn_origin')
])
Adding Variables#
# Use pre-defined variables
cohort.add_variable("pf_ratio")
# Load a variable with custom time bounds
cohort.add_variable(
variable="anx_dx_covid_19",
tmin=("hospital_admission", "-1d"),
tmax=cohort.t_eligible
)
# Create a custom variable on the fly
cohort.add_variable(
NativeStatic(
var_name="median_sodium_before_hn",
select="!median value",
base_var="blood_sodium",
tmin="hospital_admission",
tmax=cohort.t_eligible
)
)
# Save a variable under a different name
cohort.add_variable(
variable="any_med_glu",
save_as="glucose_prior_eligible",
tmin=(cohort.t_eligible, "-48h"),
tmax=cohort.t_eligible,
)
For large cohorts it is faster to apply filters before loading default variables:
cohort = Cohort(obs_level="icu_stay", load_default_vars=False, ...)
# Filter first ...
cohort.include_list([
{"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"}
])
# ... then load default variables on the reduced cohort
cohort.load_default_vars()
Pass project_vars at construction time to register definitions before any
variables are loaded:
cohort = Cohort(
obs_level="icu_stay",
sources={"cub_hdp": {"database": "db_hypercapnia_prepared",
"conn_args": {"password_file": True}}},
project_vars={
"my_new_var": {
"type": "native_dynamic",
"table": "it_ishmed_labor",
"where": "c_katalog_leistungtext LIKE '%new%'",
"value_dtype": "DOUBLE",
"cleaning": {"value": {"low": 100, "high": 150}},
},
"blood_sodium": {
"where": "c_katalog_leistungtext LIKE '%custom_sodium%'"
},
},
)
To add or override a variable definition at runtime without editing vars.json,
use add_variable_definition(). Use get_variable_definition() to inspect the
active definition (merged local override + global defaults) per source:
# Register a brand-new variable
cohort.add_variable_definition("my_new_var", {
"type": "native_dynamic",
"table": "it_ishmed_labor",
"where": "c_katalog_leistungtext LIKE '%new%'",
"value_dtype": "DOUBLE",
"cleaning": {"value": {"low": 100, "high": 150}},
})
# Partially override an existing variable (merges with global definition)
cohort.add_variable_definition("blood_sodium", {
"where": "c_katalog_leistungtext LIKE '%custom_sodium%'"
})
# Inspect the resolved definition per data source
defn = cohort.get_variable_definition("blood_sodium")
# {"cub_hdp": {"where": "c_katalog_leistungtext LIKE '%custom_sodium%'"}}
Time Anchors#
t_eligible marks the earliest timepoint a patient is eligible for the study;
t_outcome marks the primary outcome timepoint. Both default to observation-level
column names (e.g. icu_admission / icu_discharge) but should be overridden
for most study designs.
# Extract the first SpO2 < 90 % event as the eligibility anchor
cohort.add_variable(NativeStatic(
var_name="spo2_lt_90",
base_var="spo2",
select="!first recordtime",
where="value < 90",
))
cohort.set_t_eligible("spo2_lt_90") # also drops rows where the anchor is null
# Set the outcome anchor (no rows are dropped)
cohort.set_t_outcome("hospital_discharge")
# t_eligible / t_outcome can now be used as tmax/tmin elsewhere
cohort.add_variable("blood_sodium", tmax=cohort.t_eligible)
Inclusion/Exclusion#
# Add multiple inclusion criteria at once
cohort.include_list([
{
"variable": "age_on_admission",
"operation": ">= 18",
"label": "Adult patients"
},
{
"variable": "icu_length_of_stay",
"operation": "> 2",
"label": "ICU stay > 2 days"
}
])
# Add exclusion criteria at once
cohort.exclude_list([
{
"variable": "any_dx_covid_19",
"operation": "== True",
"label": "Exclude COVID-19 patients"
}
])
Use include() / exclude() to apply a single criterion at a later stage:
cohort.include(
variable="age_on_admission",
operation=">= 18",
label="Adult",
operations_done="Include only adult patients",
)
cohort.exclude(
variable="elix_total",
operation="> 20",
operations_done="Exclude patients with high Elixhauser score",
)
For grouped criteria that should appear as a single step in the flowchart, use the
change_tracker() context manager:
with cohort.change_tracker("Adults", mode="include") as track:
track.filter(pl.col("age_on_admission") >= 18)
Exploration#
# Create a TableOne summary
tableone = cohort.to_tableone(ignore_cols=["icu_id"])
print(tableone)
# Grouped TableOne (e.g. by sex)
tableone = cohort.to_tableone(groupby="sex", pval=True)
tableone.to_csv("tableone_sex.csv")
# Interactive Jupyter widget for obs
cohort.widget # renders the full obs DataFrame
cohort.to_widget("age_on_admission", "sex") # select specific columns
# Interactive widget for dynamic variables
cohort.obsm.widget # renders all obsm variables
cohort.obsm.to_widget("blood_sodium") # renders a single variable
# Inclusion / exclusion flowchart
cohort.figureone # returns a graphviz.Digraph object
cohort.to_figureone() # returns a graphviz.Digraph object
# Print debug information (helpful when filing a GitHub issue)
cohort.debug_print()
Data Export#
# Save to CORR archive (recommended)
cohort.save("my_cohort.corr3")
# Load from file (.corr2 and .corr3 supported)
cohort = Cohort.load("my_cohort.corr3")
# Export obs + all obsm variables as individual CSV files
cohort.to_csv("path/to/output_folder")
# Export obs + all obsm variables as individual Parquet files
cohort.to_parquet("path/to/output_folder")
# Convert obs to a Stata-compatible pandas DataFrame
stata_df = cohort.to_stata()
# Save obs directly as a .dta Stata file
cohort.to_stata(to_file="path/to/output_folder/my_cohort.dta")
Class Reference#
- class corr_vars.core.cohort.Cohort(obs_level='icu_stay', sources={'cub_hdp': {'conn_args': {'password_file': True}, 'database': 'db_hypercapnia_prepared', 'icu_stay': {'merge_consecutive': True}}}, project_vars={}, load_default_vars=True, logger_args={})[source]#
Bases:
objectClass to build a cohort in the CORR database.
- Parameters:
obs_level (Literal["patient", "hospital_stay", "icu_stay", "procedure"]) –
Observation level (default: “icu_stay”).
"patient"gives one row per patient (primary key:patient_id)"hospital_stay"per hospitalisation (case_id)"icu_stay"per ICU admission (icu_stay_id)"procedure"per surgical procedure (procedure_id).
sources (dict[str, dict]) –
Dictionary of data sources to use for data extraction. Available options are “cub_hdp”, “cub_hdp_dummy”, “reprodicu”. Source configurations:
cub_hdp:
database,conn_args(password_file,remote_hostname,username),icu_stay(merge_consecutive,extra_columns),filter(extraction_start_date,extraction_end_date,additional_filters,include_adults_only,exclude_dhzb,exclude_brain_death)cub_hdp_dummy:
size,seedreprodicu:
path,exclude_datasets,include_datasets
Note: reprodicu does not yet implement variable extraction, only cohort data.
project_vars (dict) – Dictionary with local variable definitions (default: {}).
load_default_vars (bool) – Whether to load the default variables (default: True).
logger_args (dict) – Dictionary of Logging configurations [level (int), file_path (str), file_mode (str), verbose_fmt (bool), colored_output (bool), formatted_numbers (bool)] (default: {}).
- obs#
Static data for each observation. Contains one row per observation (e.g., ICU stay) with columns for static variables like demographics and outcomes.
Example
>>> cohort.obs patient_id case_id icu_stay_id icu_admission icu_discharge sex ... inhospital_death 0 P001 C001 C001_1 2023-01-01 08:30:00 2023-01-03 12:00:00 M ... False 1 P001 C001 C001_2 2023-01-03 14:20:00 2023-01-05 16:30:00 M ... False 2 P002 C002 C002_1 2023-01-02 09:15:00 2023-01-04 10:30:00 F ... False 3 P003 C003 C003_1 2023-01-04 11:45:00 2023-01-07 13:20:00 F ... True ...
- Type:
pl.DataFrame
- obsm#
Dynamic data stored as dictionary of DataFrames. Each DataFrame contains time-series data for a variable with columns:
recordtime: Timestamp of the measurement
value: Value of the measurement
recordtime_end: End time (only for duration-based variables like therapies)
description: Additional information (e.g., medication names)
Example
>>> cohort.obsm["blood_sodium"] icu_stay_id recordtime value 0 C001_1 2023-01-01 09:30:00 138 1 C001_1 2023-01-02 10:15:00 141 2 C001_2 2023-01-03 15:00:00 137 3 C002_1 2023-01-02 10:00:00 142 4 C003_1 2023-01-04 12:30:00 139 ...
- Type:
dict of pl.DataFrame
Notes
For large cohorts, set
load_default_vars=Falseto speed up the extraction. You can use pre-extracted cohorts as starting points and load them usingCohort.load().Variables can be added using
cohort.add_variable(). Static variables will be added toobs, dynamic variables toobsm.For quick prototyping, use
sources["cub_hdp"]["filter"]["additional_filters"]with a"_dx"shorthand (e.g."_d2"for the last 2 months), or use"cub_hdp_dummy"for fully synthetic data.
Examples
Create a new cohort:
>>> cohort = Cohort( ... obs_level="icu_stay", ... load_default_vars=False, ... sources={ ... "cub_hdp": { ... "database": "db_hypercapnia_prepared", ... "conn_args": {"password_file": True}, ... "icu_stay": {"merge_consecutive": False}}, ... "reprodicu": { ... "path": "/data02/projects/reprodicubility/reprodICU/reprodICU_files"} ... })
Access static data:
>>> cohort.obs.select("age_on_admission") # Get age for all patients >>> cohort.obs.filter(pl.col("sex") == "M") # Filter for male patients
Access time-series data:
>>> cohort.obsm["blood_sodium"] # Get all blood sodium measurements >>> # Get blood sodium measurements for a specific observation >>> cohort.obsm["blood_sodium"].filter(pl.col(cohort.primary_key) == "12345")
- constant_vars: list[str]#
- obs_level: ObsLevel#
- primary_key: str#
- t_min: str#
- t_max: str#
- t_eligible: str#
- t_outcome: str#
- logger_args: dict[str, Any]#
- sources: SourceDict#
- project_vars: dict[str, dict[str, Any]]#
- tmpdir_manager: Final[TemporaryDirectoryManager]#
- load_default_vars(tmin=None, tmax=None)[source]#
Load the default variables defined in
vars.json. It is recommended to use this after filtering your cohort for eligibility to speed up the process.- Returns:
Variables are loaded into the cohort.
- Return type:
None
Examples
>>> # Load default variables for an ICU cohort >>> cohort = Cohort(obs_level="icu_stay", load_default_vars=False) >>> # Apply filters first (faster) >>> cohort.include_list([ ... {"variable": "age_on_admission", "operation": ">= 18", "label": "Adults"} ... ]) >>> # Then load default variables >>> cohort.load_default_vars()
- Parameters:
tmin (
str|tuple[str,str] |None)tmax (
str|tuple[str,str] |None)
- add_variable(variable, save_as=None, tmin=None, tmax=None)[source]#
Add a variable to the cohort.
You may specify tmin and tmax as a tuple (e.g. (“hospital_admission”, “+1d”)), in which case it will be relative to the hospital admission time of the patient.
- Parameters:
variable (
str|VariableProtocol|MultiSourceVariable) – Variable to add. Either a string with the variable name (from vars.json) or a Variable object.save_as (
str|None) – Name of the column to save the variable as. Defaults to variable name.tmin (
str|tuple[str,str] |None) – Name of the column to use as tmin or tuple (see description).tmax (
str|tuple[str,str] |None) – Name of the column to use as tmax or tuple (see description).
- Returns:
The variable object.
- Return type:
Examples
>>> cohort.add_variable("blood_sodium")
>>> cohort.add_variable( ... variable="anx_dx_covid_19", ... tmin=("hospital_admission", "-1d"), ... tmax=cohort.t_eligible ... )
>>> cohort.add_variable( ... NativeStatic( ... var_name="highest_hct_before_eligible", ... select="!max value", ... base_var='blood_hematokrit', ... tmax=cohort.t_eligible ... ) ... )
>>> cohort.add_variable( ... variable='any_med_glu', ... save_as="glucose_prior_eligible", ... tmin=(cohort.t_eligible, "-48h"), ... tmax=cohort.t_eligible ... )
- load_variable(variable, tmin=None, tmax=None, include_sources=None)[source]#
- Parameters:
variable (
str|tuple[str,TimeWindow] |VariableProtocol|MultiSourceVariable)tmin (
str|tuple[str,str] |None)tmax (
str|tuple[str,str] |None)include_sources (
Iterable[str] |None)
- Return type:
MultiSourceVariable
- set_t_eligible(t_eligible, drop_ineligible=True)[source]#
Set the time anchor for eligibility. This can be referenced as cohort.t_eligible throughout the process and is required to add inclusion or exclusion criteria.
- Parameters:
t_eligible (
str) – Name of the column to use as t_eligible.drop_ineligible (
bool) – Whether to drop ineligible patients. Defaults to True.
- Returns:
t_eligible is set.
- Return type:
None
Examples
>>> # Add a suitable time-anchor variable >>> cohort.add_variable(NativeStatic( ... var_name="spo2_lt_90", ... base_var="spo2", ... select="!first recordtime", ... where="value < 90", ... )) >>> # Set the time anchor for eligibility >>> cohort.set_t_eligible("spo2_lt_90")
- set_t_outcome(t_outcome)[source]#
Set the time anchor for outcome. This can be referenced as cohort.t_outcome throughout the process and is recommended to specify for your study.
- Parameters:
t_outcome (str) – Name of the column to use as t_outcome.
- Returns:
t_outcome is set.
- Return type:
None
Examples
>>> cohort.set_t_outcome("hospital_discharge")
- change_tracker(description, group=None, mode='include')[source]#
Return a context manager to group cohort edits and record a single ChangeTracker state on exit.
Example
- with cohort.change_tracker(“Adults”, mode=”include”) as track:
track.filter(pl.col(“age_on_admission”) >= 18)
- Parameters:
description (
str)group (
str|None)mode (
Literal['include','exclude'])
- Return type:
ChangeTrackerContext
- include(*args, **kwargs)[source]#
Add an inclusion criterion to the cohort. It is recommended to use
Cohort.include_list()and add all of your inclusion criteria at once. However, if you need to specify criteria at a later stage, you can use this method.Warning
You should call
Cohort.include_list()before callingCohort.include()to ensure that the inclusion criteria are properly tracked.- Parameters:
variable (str | Variable)
operation (str)
label (str)
operations_done (str)
[Optional – tmin, tmax]
- Returns:
Criterion is added to the cohort.
- Return type:
None
Note
operationis passed topandas.DataFrame.query, which uses a slightly modified Python syntax. Also, if you specify “true”/”True” or “false”/”False” as a value foroperation, it will be converted to “== True” or “== False”, respectively.Examples
>>> cohort.include( ... variable="age_on_admission", ... operation=">= 18", ... label="Adult", ... operations_done="Include only adult patients" ... )
- exclude(*args, **kwargs)[source]#
Add an exclusion criterion to the cohort. It is recommended to use
Cohort.exclude_list()and add all of your exclusion criteria at once. However, if you need to specify criteria at a later stage, you can use this method.Warning
You should call
Cohort.exclude_list()before callingCohort.exclude()to ensure that the exclusion criteria are properly tracked.- Parameters:
variable (str | Variable)
operation (str)
label (str)
operations_done (str)
[Optional – tmin, tmax]
- Returns:
Criterion is added to the cohort.
- Return type:
None
Note
operationis passed topandas.DataFrame.query, which uses a slightly modified Python syntax. Also, if you specify “true”/”True” or “false”/”False” as a value foroperation, it will be converted to “== True” or “== False”, respectively.Examples
>>> cohort.exclude( ... variable="elix_total", ... operation="> 20", ... operations_done="Exclude patients with high Elixhauser score" ... )
- include_list(inclusion_list=[])[source]#
Add an inclusion criteria to the cohort.
- Parameters:
inclusion_list (list) – List of inclusion criteria. Must include a dictionary with keys: *
variable(str | Variable): Variable to use for exclusion *operation(str): Operation to apply (e.g., “> 5”, “== True”) *label(str): Short label for the exclusion step *operations_done(str): Detailed description of what this exclusion does *tmin(str, optional): Start time for variable extraction *tmax(str, optional): End time for variable extraction- Returns:
CohortTracker object, can be used to plot inclusion chart
- Return type:
ct (CohortTracker)
Note
Per default, all inclusion criteria are applied from
tmin=cohort.tmintotmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.Examples
>>> ct = cohort.include_list([ ... { ... "variable": "age_on_admission", ... "operation": ">= 18", ... "label": "Adult patients", ... "operations_done": "Excluded patients under 18 years old" ... } ... ]) >>> ct.create_flowchart()
- exclude_list(exclusion_list=[])[source]#
Add an exclusion criteria to the cohort.
- Parameters:
exclusion_list (list) –
List of exclusion criteria. Each criterion is a dictionary containing:
variable(str | Variable): Variable to use for exclusionoperation(str): Operation to apply (e.g., “> 5”, “== True”)label(str): Short label for the exclusion stepoperations_done(str): Detailed description of what this exclusion doestmin(str, optional): Start time for variable extractiontmax(str, optional): End time for variable extraction
- Returns:
CohortTracker object, can be used to plot exclusion chart
- Return type:
ct (CohortTracker)
Note
Per default, all exclusion criteria are applied from
tmin=cohort.tmintotmax=cohort.t_eligible. This is recommended to avoid introducing immortality biases. However, in some cases you might want to set custom time bounds.Examples
>>> ct = cohort.exclude_list([ ... { ... "variable": "any_rrt_icu", ... "operation": "true", ... "label": "No RRT", ... "operations_done": "Excluded RRT before hypernatremia" ... }, ... { ... "variable": "any_dx_tbi", ... "operation": "true", ... "label": "No TBI", ... "operations_done": "Excluded TBI before hypernatremia" ... }, ... { ... "variable": NativeStatic( ... var_name="sodium_count", ... select="!count value", ... base_var="blood_sodium"), ... "operation": "< 1", ... "label": "Final cohort", ... "operations_done": "Excluded cases with less than 1 sodium measurement after hypernatremia", ... "tmin": cohort.t_eligible, ... "tmax": "hospital_discharge" ... } ... ]) >>> ct.create_flowchart() # Plot the exclusion flowchart
- add_variable_definition(var_name, var_dict)[source]#
Add or update a local variable definition.
- Parameters:
var_name (str) – Name of the variable.
var_dict (dict) – Dictionary containing variable definition. Can be partial - missing fields will be inherited from global definition.
- Return type:
None
Examples
Add a completely new variable:
>>> cohort.add_variable_definition("my_new_var", { ... "type": "native_dynamic", ... "table": "it_ishmed_labor", ... "where": "c_katalog_leistungtext LIKE '%new%'", ... "value_dtype": "DOUBLE", ... "cleaning": {"value": {"low": 100, "high": 150}} ... })
Partially override existing variable:
>>> cohort.add_variable_definition("blood_sodium", { ... "where": "c_katalog_leistungtext LIKE '%custom_sodium%'" ... })
- get_variable_definition(var_name)[source]#
Get the variable definition for a given variable name.
- Parameters:
var_name (str) – Name of the variable to get variable definitions for.
- Returns:
Dictionary of variable definitions per source.
- Return type:
definition (dict[str, dict[str, Any]])
- to_files(folder, ext)[source]#
Convenience method to save the cohort to various file formats
- Parameters:
folder (str | PathLike[str]) – Path to the folder where the files will be saved.
ext (CohortExportFormats) – File extension to use for saving the files.
- Return type:
None
- to_csv(folder)[source]#
Save the cohort to CSV files.
- Parameters:
folder (str | PathLike[str]) – Path to the folder where CSV files will be saved.
- Return type:
None
Examples
>>> cohort.to_csv("output_data") >>> # Creates: >>> # output_data/_obs.csv >>> # output_data/blood_sodium.csv >>> # output_data/heart_rate.csv >>> # ... (one file per variable)
- to_parquet(folder)[source]#
Save the cohort to parquet files.
- Parameters:
folder (str | PathLike[str]) – Path to the folder where parquet files will be saved.
- Return type:
None
Examples
>>> cohort.to_parquet("output_data") >>> # Creates: >>> # output_data/_obs.parquet >>> # output_data/blood_sodium.parquet >>> # output_data/heart_rate.parquet >>> # ... (one file per variable)
- save(filename)[source]#
Save the cohort to a single compressed .corr3 archive (.tar.zst equivalent). Saves cohort.__dict__ (excluding obs, obsm, variables, conn) to state.pkl, obs to obs.parquet, and each obsm DataFrame to obsm_<var_name>.parquet in a temp dir.
- Parameters:
filename (
str|PathLike[str]) – Path to the .corr3 archive- Return type:
None- Returns:
None
- classmethod load(filename, password=None)[source]#
- Parameters:
filename (
str|PathLike[str])password (
str|None)
- Return type:
- property tmpdir_path#
- property widget: ObsWidget#
- to_search_widget(include_sources=None)[source]#
- Parameters:
include_sources (
Iterable[str] |None)- Return type:
JsonWidget|JsonmWidget
- property search_widget: JsonWidget | JsonmWidget#
- unnest(column, prefix='', suffix='', renamer=None)[source]#
Unnest a struct column in the obs DataFrame.
- Parameters:
column (str) – The column to unnest.
prefix (str) – The prefix to add to the unnested column names.
suffix (str) – The suffix to add to the unnested column names.
renamer (Sequence[str] | Callable[[str], str] | None) – The renamer function or list of new names.
- Returns:
The obs DataFrame with the unnested column.
- Return type:
obs (pl.DataFrame)
- debug_print()[source]#
Print debug information about the cohort. Please use this if you are creating a GitHub issue.
- Return type:
None- Returns:
None
- to_stata(df=None, convert_dates=None, write_index=True, to_file=None)[source]#
Convert the cohort to a Stata DataFrame. You may use cohort.stata to access the dataframe directly. Note that you need to save it to a top-level variable to access it via %%stata.
- Parameters:
df (pd.DataFrame | None) – The DataFrame to be converted to Stata format. Will default to the obs DataFrame if unspecified (default: None)
convert_dates (dict[Hashable, StataDateFormat]) – Dictionary of columns to convert to Stata date format.
write_index (bool) – Whether to write the index as a column.
to_file (str | PathLike[str] | None) – Path to save as .dta file. If left unspecified, the DataFrame will not be saved.
- Returns:
A Pandas Dataframe compatible with Stata if to_file is None.
- Return type:
pd.DataFrame
- property stata: DataFrame | None#
- to_tableone(df=None, ignore_cols=None, filter_query=None, replace_booleans=('Yes', 'No'), display_all=True, groupby=None, normal_cols=None, overall=None, order=None, pval=False, **kwargs)[source]#
Create a TableOne object for the cohort.
- Parameters:
df (pl.DataFrame) – The DataFrame to be converted to Stata format. Will default to the obs DataFrame if unspecified (default: None)
ignore_cols (list | str | None) – Column(s) to ignore.
filter_query (str | None) – Filter to apply to the data.
replace_booleans (tuple[str, str] | None) – Replace booleans with the given strings.
display_all (bool) – Whether to display all columns.
groupby (str | None) – Column to group by.
normal_cols (list[str] | None) – Columns to treat as normally distributed.
overall (bool) – Whether to add an “overall” column to the table. If left unspecified the overall column will be dropped if groupby is specified.
order (dict[str, list[str]] | None) – Order of categorical columns.
pval (bool) – Whether to calculate p-values.
**kwargs – Additional arguments to pass to TableOne.
- Returns:
A TableOne object.
- Return type:
TableOne
Examples
>>> tableone = cohort.tableone() >>> print(tableone) >>> tableone.to_csv("tableone.csv")
>>> tableone = cohort.tableone(groupby="sex", pval=False) >>> print(tableone) >>> tableone.to_csv("tableone_sex.csv")
- property tableone: TableOne#
- property figureone: Digraph#