-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description
The xl group command does not allow multiple aggregations on the same column. Users cannot get sum, count, mean, min, max in a single operation and must run separate commands for each statistic.
Current Behavior
$ xl group "sales.xlsx" -s DONNEES --by "Circuit" --aggregate "Qté VTE:sum,Qté VTE:count,Qté VTE:mean"
Error parsing aggregation specifications:
- Duplicate column 'Qté VTE'
- Duplicate column 'Qté VTE'The command fails with a "duplicate column" error when trying to aggregate the same column multiple ways.
Expected Behavior
The command should allow multiple aggregations per column:
$ xl group "sales.xlsx" -s DONNEES --by "Circuit" \
--aggregate "Qté VTE:sum,Qté VTE:count,Qté VTE:mean,Qté VTE:min,Qté VTE:max"
+-----------+---------------+------------------+--------------+-------------+-------------+
| Circuit | Qté VTE_sum | Qté VTE_count | Qté VTE_mean | Qté VTE_min | Qté VTE_max |
+===========+===============+==================+==============+=============+=============+
| 24H | 5890 | 1967 | 2.994 | -2 | 483 |
+-----------+---------------+------------------+--------------+-------------+-------------+
| GP | 23788 | 4073 | 5.842 | -2 | 432 |
+-----------+---------------+------------------+--------------+-------------+-------------+
...Steps to Reproduce
- Create an Excel file with numeric data
- Try to aggregate the same column multiple ways:
xl group data.xlsx --by "Category" --aggregate "Sales:sum,Sales:count,Sales:mean"
- Observe "Duplicate column" error
Real-World Impact
Use Case: Comprehensive Circuit Statistics
Current Workflow (4 commands):
# Command 1: Sum
xl group sales.xlsx --by "Circuit" --aggregate "Qté VTE:sum" --output sum.xlsx
# Command 2: Count
xl group sales.xlsx --by "Circuit" --aggregate "Qté VTE:count" --output count.xlsx
# Command 3: Mean
xl group sales.xlsx --by "Circuit" --aggregate "Qté VTE:mean" --output mean.xlsx
# Command 4: Manual merge in Excel
# Open sum.xlsx, count.xlsx, mean.xlsx
# Copy-paste to create combined tableProposed Workflow (1 command):
xl group sales.xlsx --by "Circuit" \
--aggregate "Qté VTE:sum,Qté VTE:count,Qté VTE:mean,Qté VTE:std"Impact:
- 4 commands + manual work → 1 command
- Error-prone copy-paste operations
- Inconsistent results (data might change between commands)
- Time-consuming workflow
Business Use Cases
1. Statistical Summary
xl group sales.xlsx --by "Product" \
--aggregate "Sales:sum,Sales:count,Sales:mean,Sales:median,Sales:std"
# Want complete distribution overview2. Data Quality Check
xl group data.xlsx --by "Category" \
--aggregate "Value:sum,Value:min,Value:max,Value:count"
# Check for anomalies in aggregated data3. Performance Monitoring
xl group transactions.xlsx --by "Store" \
--aggregate "Revenue:sum,Transaction_ID:count,Revenue:mean"
# Total revenue, transaction count, and average order valueProposed Solution
Modify the aggregation parser to allow multiple aggregations of the same column with different functions.
Syntax Options
Option 1: Comma-separated (Current syntax, fix bug)
--aggregate "Sales:sum,Sales:count,Sales:mean"Option 2: Semicolon-separated aggregations
--aggregate "Sales:sum;Sales:count;Sales:mean"Option 3: Multiple --aggregate flags
--aggregate "Sales:sum" --aggregate "Sales:count" --aggregate "Sales:mean"Recommendation: Fix Option 1 (most intuitive, already documented)
Acceptance Criteria
- Can aggregate same column with sum and count
- Can aggregate same column with sum, count, mean, min, max
- Can aggregate same column with all 8 functions (sum, mean, avg, median, min, max, count, std, var)
- Output columns named clearly:
Column_function(e.g.,Sales_sum,Sales_mean) - Works with single group-by column
- Works with multiple group-by columns
- Works with multiple different columns
- Documentation updated with examples
- Error handling: clear error if function name is invalid
Test Cases
# Test 1: Two aggregations same column
xl group test.xlsx --by "Category" --aggregate "Sales:sum,Sales:count"
# Expected: Category | Sales_sum | Sales_count
# Test 2: Three aggregations same column
xl group test.xlsx --by "Product" --aggregate "Qty:sum,Qty:mean,Qty:std"
# Expected: Product | Qty_sum | Qty_mean | Qty_std
# Test 3: Multiple columns with multiple aggregations
xl group test.xlsx --by "Region" \
--aggregate "Sales:sum,Sales:count,Profit:sum,Profit:mean"
# Expected: Region | Sales_sum | Sales_count | Profit_sum | Profit_mean
# Test 4: All aggregation functions
xl group test.xlsx --by "Store" \
--aggregate "Revenue:sum,Revenue:mean,Revenue:median,Revenue:min,Revenue:max,Revenue:std,Revenue:var,Revenue:count"
# Expected: All 8 statistics calculated
# Test 5: Multiple group-by columns
xl group test.xlsx --by "Year,Region" \
--aggregate "Sales:sum,Sales:count"
# Expected: Year | Region | Sales_sum | Sales_countImplementation Notes
Column Naming Strategy
When multiple aggregations are applied, suffix each column with the function name:
# Input: --aggregate "Sales:sum,Sales:count"
# Output columns: Sales_sum, Sales_countAlternative: Use parentheses
# Output columns: Sales(sum), Sales(count)Recommendation: Underscore suffix (cleaner, easier to reference)
Aggregation Function Support
Ensure all documented functions work with multi-aggregation:
| Function | Description | Example |
|---|---|---|
| sum | Sum of values | Sales:sum |
| mean | Average (alias for avg) | Sales:mean |
| avg | Average | Sales:avg |
| median | Median value | Sales:median |
| min | Minimum value | Sales:min |
| max | Maximum value | Sales:max |
| count | Count of non-null values | Sales:count |
| std | Standard deviation | Sales:std |
| var | Variance | Sales:var |
Error Handling
# Invalid function name
xl group test.xlsx --by "Category" --aggregate "Sales:invalid"
# Error: "Unknown aggregation function 'invalid'. Valid functions: sum, mean, avg, median, min, max, count, std, var"
# Invalid column name
xl group test.xlsx --by "Category" --aggregate "NonExistent:sum"
# Error: "Column 'NonExistent' not found in data"Related Issues
- XL-002: No command piping (workaround requires 4 commands)
- XL-003: No sorting in group (complementary feature)
- XL-005: Limit/Top-N in group (complementary feature)
Combined Use Case
When combined with sorting and limiting:
# Top 10 products by sales with full statistics
xl group sales.xlsx --by "Product" \
--aggregate "Sales:sum,Sales:count,Sales:mean" \
--sort desc --limit 10Workarounds
Current Workaround 1: Separate Commands
xl group data.xlsx --by "Category" --aggregate "Sales:sum" --output sum.xlsx
xl group data.xlsx --by "Category" --aggregate "Sales:count" --output count.xlsx
xl group data.xlsx --by "Category" --aggregate "Sales:mean" --output mean.xlsx
# Then merge in Excel/PythonCurrent Workaround 2: Use Python/pandas
import pandas as pd
df = pd.read_excel('data.xlsx')
result = df.groupby('Category')['Sales'].agg(['sum', 'count', 'mean', 'std'])
result.to_excel('output.xlsx')Current Workaround 3: Manual Calculation
xl group data.xlsx --by "Category" --aggregate "Sales:sum"
# Manually calculate mean = sum / count in ExcelBenefits
- Workflow Efficiency: 4 commands → 1 command
- Data Consistency: All statistics from same data snapshot
- Better UX: Matches user expectations from SQL, pandas
- Time Savings: ~5 minutes per analysis
- Standard Behavior: SQL
GROUP BYsupports this, pandas supports this
SQL Equivalence
-- What users want (SQL):
SELECT
Category,
SUM(Sales) as Sales_sum,
COUNT(Sales) as Sales_count,
AVG(Sales) as Sales_mean,
STDDEV(Sales) as Sales_std
FROM data
GROUP BY Category
ORDER BY Sales_sum DESC;xl should match this capability.
Labels
bug feature-request high-priority aggregation group-by statistics