DataFrame Utilities#

Conversion helpers (Polars ↔ Pandas ↔ Stata ↔ TableOne), time-series joins, value cleaning, and aggregation utilities for clinical data frames.

Conversion#

corr_vars.utils.frames.convert_to_polars_df(value)[source]#
Parameters:

value (DataFrame | LazyFrame | DataFrame)

Return type:

DataFrame

corr_vars.utils.frames.convert_to_polars_lf(value)[source]#
Parameters:

value (DataFrame | LazyFrame | DataFrame)

Return type:

LazyFrame

corr_vars.utils.frames.convert_to_pandas_df(value)[source]#
Parameters:

value (DataFrame | LazyFrame | DataFrame)

Return type:

DataFrame

corr_vars.utils.frames.convert_to_stata(df, convert_dates=None, write_index=True, to_file=None)[source]#

Convert a pandas DataFrame to a Stata compatible DataFrame or writes the DataFrame to a Stata file.

Parameters:
  • df (pd.DataFrame) – The DataFrame to be converted to Stata format.

  • convert_dates (dict[Hashable, str]) – Dictionary of columns to convert to Stata date format.

  • write_index (bool) – Whether to write the index as a column.

  • to_file (str | None) – Path to save as .dta file. If left unspecified, the DataFrame will not be saved.

Returns:

A Pandas Dataframe compatible with Stata if to_file is None.

Return type:

pd.DataFrame

corr_vars.utils.frames.convert_to_tableone(df, ignore_cols=None, filter=None, replace_booleans=('Yes', 'No'), display_all=True, groupby=None, normal_cols=None, overall=None, order=None, pval=False, **kwargs)[source]#

Create a TableOne object for the pandas DataFrame.

Parameters:
  • df (pl.DataFrame) – The input.

  • ignore_cols (list | str | None) – Column(s) to ignore.

  • filter (str | None) – Filter to apply to the data.

  • replace_booleans (tuple[str, str] | None) – Replace booleans with the given strings.

  • display_all (bool) – Whether to display all columns.

  • groupby (str | None) – Column to group by.

  • normal_cols (list[str] | None) – Columns to treat as normally distributed.

  • overall (bool) – Whether to add an “overall” column to the table. If left unspecified the overall column will be dropped if groupby is specified.

  • order (dict[str, list[str]] | None) – Order of categorical columns.

  • pval (bool) – Whether to calculate p-values.

  • **kwargs – Additional arguments to pass to TableOne.

Returns:

A TableOne object.

Return type:

TableOne

Time-Series Operations#

corr_vars.utils.frames.time_difference(time_col, reference_col, *, unit='s', total=False)[source]#

Calculates the time difference between two columns.

Parameters:
  • time_col (Expr | str)

  • reference_col (Expr | str)

  • unit (Literal['s', 'm', 'h', 'd', 'w'])

  • total (bool)

Return type:

Expr

corr_vars.utils.frames.remove_asof(main, ref, strategy='nearest', by='case_id', on='recordtime', tolerance='1d')[source]#

Removes entries from ref to main with choosen search strategy and tolerance.

Parameters:
  • main (TypeVar(PolarsFrame, bound= DataFrame | LazyFrame))

  • ref (TypeVar(PolarsFrame, bound= DataFrame | LazyFrame))

  • strategy (Literal['backward', 'forward', 'nearest'])

  • by (str)

  • on (str)

  • tolerance (str)

Return type:

TypeVar(PolarsFrame, bound= DataFrame | LazyFrame)

corr_vars.utils.frames.find_nearest(df, time, pm_before='52w', pm_after='52w', recordtime='recordtime')[source]#

Find the closest record to the target time.

Parameters:
  • df (pl.DataFrame) – The input dataframe.

  • to_col (str) – The column containing the target time.

  • tdelta (str) – The time delta. Defaults to “0”.

  • pm_before (str) – The time range before the target time. Defaults to “52w”.

  • pm_after (str) – The time range after the target time. Defaults to “52w”.

  • time (str | tuple[str, str])

  • recordtime (str)

Returns:

The closest record.

Return type:

pl.DataFrame

corr_vars.utils.frames.interval_bucket_agg(data, primary_key='case_id', recordtime='recordtime', recordtime_end=None, recordtime_end_gap_treshold=None, recordtime_end_gap_behavior='drop', duration_empty_impute_duration=None, value='value', tmin=None, tmax=None, t0=None, t_missing_behavior='drop', granularity='1h', same_bin_aggregation=None)[source]#

Bins intervals clipped between tmin and tmax into bins aligned to t0 and aggregates on these bins.

Parameters:
  • data (pl.DataFrame | pl.LazyFrame) – The data to transform.

  • primary_key (str) – The primary key column. Defaults to “case_id”.

  • recordtime (str) – The recordtime column. Defaults to “recordtime”.

  • recordtime_end (str | None) – The recordtime_end column. Defaults to None. Will impute recordtime_end from the next recordtime if set to None.

  • recordtime_end_gap_treshold (str | None) – Maximum allowed gap for recordtime_end. Only applies if recordtime_end is set to None.

  • recordtime_end_gap_behavior (str | None) – Determines whether to impute or drop rows when the recordtime_end_gap_treshold is exceeded.

  • duration_empty_impute_duration (str | None) – The duration to be imputed by offsetting recordtime_end. Drops rows if set to default of None or ≤0s.

  • value (str) – The value column. Defaults to “value”.

  • tmin (str | None) – The tmin column. Defaults to None.

  • tmax (str | None) – The tmax column. Defaults to None.

  • t0 (str | None) – The t0 column. Bins will align to this datetime col if specified. Defaults to None.

  • t_missing_behavior (str | None) – Determines whether to drop or keep rows when t0, tmin or tmax columns are specified and values are missing in data. Defaults to “drop”.

  • granularity (str | None) – The size of the bucket intervals. Defaults to 1 hour.

  • same_bin_aggregation (Callable[[Expr], Expr] | None) – (Callable[[pl.Expr], pl.Expr] | None): The aggregation to perform on value for each bin. Uses pl.sum() if set to None.

Note

The granularity, recordtime_end_impute_gap_treshold and duration_empty_impute_duration arguments are created with the following string language:

  • 1ns (1 nanosecond) # nanoseconds are not supported by our DataFrames.

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 calendar day)

  • 1w (1 calendar week)

  • 1mo (1 calendar month)

  • 1q (1 calendar quarter)

  • 1y (1 calendar year)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.

If you do not wish to use “calendar day” or any other “calendar” durations, specify the duration in hours or smaller units instead.

Returns:

The clipped and binned data. Same type as data.

Return type:

clip_bin_data (pl.DataFrame | pl.LazyFrame)

Value Operations#

corr_vars.utils.frames.apply_cleaning(df, cleaning)[source]#

Apply cleaning rules to filter out invalid values.

Returns:

Cleaned DataFrame

Return type:

cleaned (pl.DataFrame)

Parameters:
  • df (DataFrame)

  • cleaning (dict[str, dict[Literal['low', 'high'], Any]])

corr_vars.utils.frames.absolute_and_relative_value_counts(input, cols, sort='counts', decimals=2)[source]#

Returns the absolute and relative value_counts for the column(s) cols of a pl.LazyFrame or pl.DataFrame.

Parameters:
  • input (LazyFrame | DataFrame)

  • cols (list[str] | str)

  • sort (Optional[Literal['cols', 'counts']])

  • decimals (int)

Return type:

DataFrame

corr_vars.utils.frames.unique_sucessive(value)[source]#

Drop sucessive duplicate values of a list column.

Parameters:

value (Expr | str)

Return type:

Expr

corr_vars.utils.frames.as_expr(value)[source]#
Parameters:

value (Expr | str)

Return type:

Expr