Legacy Interface (Pandas-Based)#

πŸ”„ Migration Recommended

For new projects, use the Polars-native interface (from corr_vars import Cohort) which is 2-10x faster and more memory-efficient.

This legacy interface is provided for backward compatibility only.

Overview#

The legacy interface provides a Pandas-based wrapper around the new Polars-native CORR-Vars interface. It was created to maintain backward compatibility with existing code that uses Pandas DataFrame methods, allowing researchers to continue using their existing analysis pipelines without immediate rewrites.

Interface Comparison at a Glance#

Aspect

Legacy Interface 🐼

Polars-Native Interface ⚑

Import

from corr_vars.legacy_v1 import Cohort

from corr_vars import Cohort

Data Access

cohort.obs (Pandas DataFrame)

cohort.obs (Polars DataFrame)

Performance

Slower (conversion overhead)

2-10x faster

Memory Usage

Higher (dual storage)

Lower (single storage)

Syntax

Familiar Pandas syntax

Modern Polars expressions

Recommendation

Existing code migration

New projects

What is the Legacy Interface?#

Note

Architecture Overview

The legacy interface is a compatibility layer that bridges old and new:

🐼 Your Pandas Code
.loc[], .groupby(), .head()
β†’
πŸ”„ Legacy Wrapper
Automatic conversion
β†’
⚑ Polars Backend
Fast & efficient

Key Features:

πŸ”„ Automatic Conversion

Seamlessly converts between Polars (internal) and Pandas (user-facing) representations

🐼 Familiar Syntax

Preserves familiar Pandas methods like .loc[], .iloc[], .groupby()

⚑ Modern Backend

Uses the new Polars backend internally for improved performance and stability

πŸ”§ Backward Compatible

Maintains compatibility with existing analysis scripts and workflows

Using the Legacy Interface#

# Legacy interface (Pandas access)
from corr_vars.legacy_v1 import Cohort

cohort = Cohort(
    obs_level="icu_stay",
    database="db_hypercapnia_prepared",
    password_file=True
)

# Pandas DataFrame access
print(type(cohort.obs))    # pandas.DataFrame wrapper
print(type(cohort.obsm))   # dict of pandas.DataFrame wrappers
# Polars-native interface (Recommended)
from corr_vars import Cohort

cohort = Cohort(
    obs_level="icu_stay",
    sources={"cub_hdp": {"database": "db_hypercapnia_prepared", "password_file": True}}
)

# Polars DataFrame access
print(type(cohort.obs))    # polars.DataFrame
print(type(cohort.obsm))   # dict of polars.DataFrame

πŸ’‘ Quick Start Guide

Step 1: Import the legacy interface

Step 2: Create your cohort with familiar parameters

Step 3: Use standard Pandas syntax for analysis

Pandas-Style Data Access:

# Static data access (exactly like Pandas)
print(cohort.obs.head())                    # First 5 rows
print(cohort.obs.shape)                     # (n_rows, n_cols)
print(cohort.obs.columns.tolist())          # Column names

# Familiar Pandas indexing and filtering
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
males = cohort.obs.loc[cohort.obs["sex"] == "M"]
specific_patient = cohort.obs.iloc[0]

# Pandas aggregation methods
summary = cohort.obs.groupby("sex").agg({
    "age_on_admission": ["mean", "std"],
    "inhospital_death": "sum"
})

Time-Series Data Access:

# Add dynamic variable
cohort.add_variable("blood_sodium")

# Access time-series data (Pandas DataFrame)
sodium_data = cohort.obsm["blood_sodium"]
print(type(sodium_data))  # <class 'LegacyObsmDataframe'> (behaves like pd.DataFrame)

# Familiar Pandas time-series operations
patient_data = sodium_data[sodium_data["icu_stay_id"] == "12345"]
daily_avg = sodium_data.groupby(sodium_data["recordtime"].dt.date)["value"].mean()

# Standard Pandas methods work
print(sodium_data.describe())
print(sodium_data.value_counts())

Column Assignment (Limited):

# Direct column assignment works for obs
cohort.obs["bmi_category"] = cohort.obs["weight"] / (cohort.obs["height"] / 100) ** 2
cohort.obs["is_elderly"] = cohort.obs["age_on_admission"] > 65

# Note: obsm DataFrames are read-only to prevent data corruption
# sodium_data["new_col"] = 1  # This will raise NotImplementedError

Pandas Method Compatibility#

βœ… Full Pandas Compatibility

The legacy interface supports most common Pandas DataFrame methods out of the box!

πŸ” Data Inspection
  • .head(), .tail(), .info()

  • .describe(), .shape, .columns

  • .nunique(), .value_counts()

  • .isnull(), .dtypes

🎯 Indexing & Selection
  • .loc[], .iloc[], .at[], .iat[]

  • .query(), boolean indexing

  • .filter(), .select_dtypes()

πŸ”§ Data Manipulation
  • .groupby(), .pivot_table()

  • .merge(), .join()

  • .sort_values(), .drop()

  • .drop_duplicates()

πŸ“Š Statistical Methods
  • .mean(), .median(), .std()

  • .corr(), .agg(), .apply()

  • .transform()

⏰ Time-Series Methods
  • .resample(), .rolling()

  • .expanding()

  • DateTime indexing

πŸ“ˆ Visualization Ready
  • Direct plotting with matplotlib

  • Seaborn compatibility

  • Works with existing viz code

Limitations of the Legacy Interface#

Warning

Important Limitations to Consider

While the legacy interface maintains compatibility, it has several important limitations that may affect performance and functionality.

Detailed Limitation Analysis

Understanding these limitations will help you decide when to migrate to the Polars-native interface.

Performance Limitations#

# Legacy interface: Data conversion overhead
large_cohort = Cohort(obs_level="icu_stay", load_default_vars=True)  # Slower

# Polars-native: Direct access, no conversion
from corr_vars import Cohort as PolarsCohort
fast_cohort = PolarsCohort(obs_level="icu_stay", load_default_vars=True)  # Faster
  1. Memory Overhead: Data is stored in Polars but converted to Pandas for access, requiring additional memory

  2. Conversion Costs: Each access to .obs or .obsm triggers Polars β†’ Pandas conversion

  3. Large Dataset Issues: Very large cohorts may hit memory limits during conversion

  4. Slower Operations: Pandas operations are generally slower than equivalent Polars operations

Functional Limitations#

# 1. Limited obsm modification
cohort.obsm["blood_sodium"]["new_column"] = 1  # NotImplementedError

# 2. No direct polars access
# cohort._obs.filter(pl.col("age") > 18)  # Not recommended, internal API

# 3. Some advanced Polars features unavailable
# No lazy evaluation, no expression API
  1. Read-Only obsm: Time-series DataFrames (obsm) are read-only to prevent data corruption

  2. No Polars Expression API: Cannot use Polars’ powerful expression syntax

  3. No Lazy Evaluation: Cannot benefit from Polars’ lazy evaluation optimizations

  4. Limited Parallel Processing: Pandas operations are less optimized for parallel execution

Data Type Limitations#

# Some Polars data types don't translate perfectly to Pandas
# May lose precision or type information in edge cases
print(cohort.obs.dtypes)  # May show different types than native Polars
  1. Type Conversion Issues: Some Polars types may not translate perfectly to Pandas

  2. Precision Loss: Potential precision loss in numeric conversions

  3. Missing Value Handling: Different null/missing value semantics between libraries

Migration Guide: Legacy β†’ Polars-Native#

Step 1: Update Imports

# Before (Legacy)
from corr_vars.legacy_v1 import Cohort

# After (Polars-native)
from corr_vars import Cohort

Step 2: Update Data Access Patterns

# Legacy Pandas syntax
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
male_patients = cohort.obs.loc[cohort.obs["sex"] == "M"]

# Polars-native equivalent
adults = cohort.obs.filter(pl.col("age_on_admission") >= 18)
male_patients = cohort.obs.filter(pl.col("sex") == "M")

Step 3: Update Aggregations

# Legacy Pandas groupby
summary = cohort.obs.groupby("sex").agg({
    "age_on_admission": ["mean", "std"],
    "inhospital_death": "sum"
})

# Polars-native equivalent
summary = cohort.obs.group_by("sex").agg([
    pl.col("age_on_admission").mean().alias("age_mean"),
    pl.col("age_on_admission").std().alias("age_std"),
    pl.col("inhospital_death").sum().alias("deaths")
])

Step 4: Update Time-Series Operations

# Legacy Pandas time-series
patient_data = cohort.obsm["blood_sodium"][
    cohort.obsm["blood_sodium"]["icu_stay_id"] == "12345"
]

# Polars-native equivalent
patient_data = cohort.obsm["blood_sodium"].filter(
    pl.col("icu_stay_id") == "12345"
)

Benefits of Migration:

  1. 2-10x Performance Improvement for most operations

  2. Lower Memory Usage (no conversion overhead)

  3. Better Type Safety and error handling

  4. Access to Modern Features like lazy evaluation and expression API

  5. Future-Proof Code as legacy interface may be deprecated

When to Use Legacy vs. Polars-Native#

🐼 Use Legacy Interface When:
βœ… Migrating Existing Code

You have extensive Pandas-based analysis pipelines

βœ… Team Training Time

Your team needs time to learn Polars syntax

βœ… External Dependencies

Your code integrates with Pandas-only libraries

βœ… Proof of Concepts

Quick prototyping with familiar syntax

⚠️ Temporary Migration Step

Use as stepping stone to Polars-native

⚑ Use Polars-Native Interface When:
βœ… New Projects

Starting fresh analysis projects

βœ… Performance Critical

Working with large datasets or complex operations

βœ… Memory Constrained

Limited memory environments

βœ… Production Code

Building robust, long-term analysis pipelines

βœ… Modern Features

Want to leverage advanced Polars capabilities

πŸš€ Future-Proof Choice

Recommended for all new development

🎯 Decision Matrix

Your Situation

Legacy Interface 🐼

Polars-Native ⚑

New research project

❌ Not recommended

βœ… Recommended

Existing Pandas codebase

βœ… Good transition option

πŸ”„ Migrate gradually

Large datasets (>1GB)

⚠️ Performance issues

βœ… Optimal performance

Team learning curve

βœ… Familiar syntax

πŸ“š Investment in learning

Production deployment

⚠️ Legacy, may deprecate

βœ… Future-proof

Example: Side-by-Side Comparison#

πŸ” Real-World Performance Example

Both examples below produce identical results, but with very different performance characteristics.

from corr_vars.legacy_v1 import Cohort

# Create cohort (slower initialization)
cohort = Cohort(obs_level="icu_stay", database="db_hypercapnia_prepared")

# Pandas-style analysis (familiar syntax)
adults = cohort.obs[cohort.obs["age_on_admission"] >= 18]
summary = adults.groupby("sex").agg({
    "age_on_admission": "mean",
    "inhospital_death": "sum"
})

# Time-series analysis
sodium = cohort.obsm["blood_sodium"]
patient_trends = sodium.groupby("icu_stay_id")["value"].agg(["first", "last", "mean"])
from corr_vars import Cohort
import polars as pl

# Create cohort (faster initialization)
cohort = Cohort(obs_level="icu_stay", sources={"cub_hdp": {"database": "db_hypercapnia_prepared"}})

# Polars-style analysis (faster execution)
summary = cohort.obs.filter(pl.col("age_on_admission") >= 18).group_by("sex").agg([
    pl.col("age_on_admission").mean().alias("mean_age"),
    pl.col("inhospital_death").sum().alias("deaths")
])

# Time-series analysis (more efficient)
patient_trends = cohort.obsm["blood_sodium"].group_by("icu_stay_id").agg([
    pl.col("value").first().alias("first_sodium"),
    pl.col("value").last().alias("last_sodium"),
    pl.col("value").mean().alias("mean_sodium")
])

πŸš€ Ready to Migrate?

Start your migration journey with the Tutorials and Getting Started and explore the Custom Variables Guide guide to learn modern Polars patterns!

πŸ“š Related Documentation

API Reference#