-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Title: GroupBy operations consume excessive memory with many unique groups
Problem Description
The GroupBy operation in excel_toolkit/operations/aggregating.py can consume excessive memory when dealing with datasets that have many unique groups (e.g., 500k rows with 100k unique groups). The current implementation creates in-memory hash tables that can grow very large.
Current Behavior
In operations/aggregating.py (line 241):
df.groupby(by=group_columns)This creates:
- A hash table mapping group keys to row indices
- Intermediate aggregation results for each group
- MultiIndex column structures for multiple aggregations
Memory Impact
The memory usage of GroupBy depends heavily on the number of unique groups:
| Rows | Unique Groups | Memory Usage | Performance |
|---|---|---|---|
| 50k | 10 | 50-100MB | ✅ Fast (1-2s) |
| 50k | 1,000 | 100-200MB | ✅ Fast (2-5s) |
| 500k | 100 | 500MB-1GB | ✅ Fast (5-10s) |
| 500k | 10,000 | 1-2GB | |
| 500k | 100,000 | 2-5GB | ❌ Slow (30-60s) |
| 500k | 400,000 | 5-10GB | 💀 May crash |
Real-World Scenarios
Scenario 1: E-commerce orders grouped by user_id
500k orders × 400k unique users
GroupBy hash table: 400k entries × 64 bytes = 25MB
Intermediate results: 400k groups × multiple aggregations = 500MB-2GB
Total: 1-3GB
Time: 30-60 seconds
Scenario 2: Transaction logs grouped by (date, store, product)
500k transactions × 50k unique (date, store, product) combinations
GroupBy hash table: 50k entries × 128 bytes (compound key) = 6.4MB
Intermediate results: 50k groups × 10 aggregations = 1-2GB
Total: 1.5-3GB
Time: 15-30 seconds
Scenario 3: High-cardinality grouping (worst case)
500k rows grouped by unique_id
GroupBy: 500k groups
Result: 500k rows (one per group)
Memory: 3-5GB for hash tables + intermediates
Time: 60+ seconds or crash
Affected Files
excel_toolkit/operations/aggregating.py(lines 241-258)excel_toolkit/commands/group.pyexcel_toolkit/commands/aggregate.py
Specific Issues
1. No Streaming GroupBy
Current implementation loads everything, groups everything, then writes. There's no incremental processing.
2. MultiIndex Column Overhead
When aggregating multiple columns:
# Line 244-258: Creates MultiIndex columns
df.groupby(by=group_columns).agg({
'Revenue': ['sum', 'mean'],
'Profit': ['sum', 'mean']
})This creates hierarchical column structures that consume extra memory.
3. No Group Cardinality Validation
No warning when grouping by high-cardinality columns that could create millions of groups.
4. String GroupBy is Slow
Grouping by string columns (e.g., user_id, email) is slower than numeric grouping due to:
- String hashing
- String comparison overhead
- Memory allocation for string objects
Proposed Solutions
1. Add Group Cardinality Validation
def validate_groupby_cardinality(df: pd.DataFrame, group_columns: list[str]):
"""Validate that groupby won't create too many groups."""
for col in group_columns:
unique_count = df[col].nunique()
total_rows = len(df)
# Warn if more than 50% unique values
if unique_count > total_rows * 0.5:
print(f"WARNING: Column '{col}' has {unique_count} unique values ({unique_count/total_rows*100:.1f}%)")
print(f"This will create {unique_count} groups and may be slow")
# Error if more than 90% unique (probably not what user wants)
if unique_count > total_rows * 0.9:
raise ValueError(
f"Column '{col}' has {unique_count} unique values ({unique_count/total_rows*100:.1f}%). "
f"This is likely too many groups for meaningful aggregation."
)2. Implement Streaming GroupBy for Known Groups
def groupby_streaming(df: pd.DataFrame, group_columns: list[str], output_path: Path):
"""Perform GroupBy in chunks when group set is known in advance."""
# Get unique groups first (sample if too many)
unique_groups = df[group_columns].drop_duplicates()
if len(unique_groups) > 10000:
print("Too many groups for streaming, using standard GroupBy")
return df.groupby(group_columns).agg(...)
# Process each group separately
results = []
for group in unique_groups:
mask = (df[group_columns] == group).all(axis=1)
group_df = df[mask]
# Aggregate this group
result = aggregate_group(group_df)
results.append(result)
return pd.concat(results)3. Optimize Data Types Before GroupBy
def optimize_for_groupby(df: pd.DataFrame, group_columns: list[str]):
"""Optimize data types to reduce GroupBy memory usage."""
# Convert strings to categorical if low cardinality
for col in group_columns:
if df[col].dtype == 'object':
unique_count = df[col].nunique()
if unique_count < len(df) * 0.5: # Less than 50% unique
df[col] = df[col].astype('category')
# Downcast numeric columns
for col in df.select_dtypes(include=['int64']).columns:
df[col] = pd.to_numeric(df[col], downcast='integer')
for col in df.select_dtypes(include=['float64']).columns:
df[col] = pd.to_numeric(df[col], downcast='float')
return df4. Lazy GroupBy with Dask
import dask.dataframe as dd
def groupby_lazy(df_path: Path, group_columns: list[str]):
"""Use Dask for out-of-core GroupBy."""
ddf = dd.read_csv(df_path)
result = ddf.groupby(group_columns).agg(...).compute()
return result5. Add Progress Indicators
from tqdm import tqdm
def groupby_with_progress(df, group_columns, agg_specs):
"""Show progress during GroupBy operation."""
unique_groups = df[group_columns].drop_duplicates()
results = []
for group in tqdm(unique_groups, desc="Grouping"):
mask = (df[group_columns] == group).all(axis=1)
group_df = df[mask]
result = group_df.agg(agg_specs)
results.append(result)
return pd.concat(results)Additional Optimizations
-
Use
observed=Truefor categorical GroupBy:df.groupby('category_column', observed=True) # Only compute observed categories
-
Limit number of aggregations:
MAX_AGGREGATIONS = 10 if len(agg_specs) > MAX_AGGREGATIONS: print(f"WARNING: {len(agg_specs)} aggregations may be slow")
-
Sample-based preview:
# Preview on sample first sample = df.sample(min(10000, len(df))) sample_result = sample.groupby(group_columns).agg(...) print(f"Sample result: {len(sample_result)} groups")
Related Issues
- File loading memory issues (Feature/command refactoring #1)
- Memory monitoring needed (xl group command rejects multiple aggregations on the same column #6)