Skip to content

xl group command rejects multiple aggregations on the same column #6

@AliiiBenn

Description

@AliiiBenn

Description

The xl group command does not allow multiple aggregations on the same column. Users cannot get sum, count, mean, min, max in a single operation and must run separate commands for each statistic.

Current Behavior

$ xl group "sales.xlsx" -s DONNEES --by "Circuit" --aggregate "Qté VTE:sum,Qté VTE:count,Qté VTE:mean"

Error parsing aggregation specifications:
  - Duplicate column 'Qté VTE'
  - Duplicate column 'Qté VTE'

The command fails with a "duplicate column" error when trying to aggregate the same column multiple ways.

Expected Behavior

The command should allow multiple aggregations per column:

$ xl group "sales.xlsx" -s DONNEES --by "Circuit" \
  --aggregate "Qté VTE:sum,Qté VTE:count,Qté VTE:mean,Qté VTE:min,Qté VTE:max"

+-----------+---------------+------------------+--------------+-------------+-------------+
| Circuit   |   Qté VTE_sum |   Qté VTE_count  | Qté VTE_mean | Qté VTE_min  | Qté VTE_max  |
+===========+===============+==================+==============+=============+=============+
| 24H       |          5890 |             1967 |     2.994    |           -2 |         483  |
+-----------+---------------+------------------+--------------+-------------+-------------+
| GP        |         23788 |             4073 |     5.842    |           -2 |         432  |
+-----------+---------------+------------------+--------------+-------------+-------------+
...

Steps to Reproduce

  1. Create an Excel file with numeric data
  2. Try to aggregate the same column multiple ways:
    xl group data.xlsx --by "Category" --aggregate "Sales:sum,Sales:count,Sales:mean"
  3. Observe "Duplicate column" error

Real-World Impact

Use Case: Comprehensive Circuit Statistics

Current Workflow (4 commands):

# Command 1: Sum
xl group sales.xlsx --by "Circuit" --aggregate "Qté VTE:sum" --output sum.xlsx

# Command 2: Count
xl group sales.xlsx --by "Circuit" --aggregate "Qté VTE:count" --output count.xlsx

# Command 3: Mean
xl group sales.xlsx --by "Circuit" --aggregate "Qté VTE:mean" --output mean.xlsx

# Command 4: Manual merge in Excel
# Open sum.xlsx, count.xlsx, mean.xlsx
# Copy-paste to create combined table

Proposed Workflow (1 command):

xl group sales.xlsx --by "Circuit" \
  --aggregate "Qté VTE:sum,Qté VTE:count,Qté VTE:mean,Qté VTE:std"

Impact:

  • 4 commands + manual work → 1 command
  • Error-prone copy-paste operations
  • Inconsistent results (data might change between commands)
  • Time-consuming workflow

Business Use Cases

1. Statistical Summary

xl group sales.xlsx --by "Product" \
  --aggregate "Sales:sum,Sales:count,Sales:mean,Sales:median,Sales:std"
# Want complete distribution overview

2. Data Quality Check

xl group data.xlsx --by "Category" \
  --aggregate "Value:sum,Value:min,Value:max,Value:count"
# Check for anomalies in aggregated data

3. Performance Monitoring

xl group transactions.xlsx --by "Store" \
  --aggregate "Revenue:sum,Transaction_ID:count,Revenue:mean"
# Total revenue, transaction count, and average order value

Proposed Solution

Modify the aggregation parser to allow multiple aggregations of the same column with different functions.

Syntax Options

Option 1: Comma-separated (Current syntax, fix bug)

--aggregate "Sales:sum,Sales:count,Sales:mean"

Option 2: Semicolon-separated aggregations

--aggregate "Sales:sum;Sales:count;Sales:mean"

Option 3: Multiple --aggregate flags

--aggregate "Sales:sum" --aggregate "Sales:count" --aggregate "Sales:mean"

Recommendation: Fix Option 1 (most intuitive, already documented)

Acceptance Criteria

  • Can aggregate same column with sum and count
  • Can aggregate same column with sum, count, mean, min, max
  • Can aggregate same column with all 8 functions (sum, mean, avg, median, min, max, count, std, var)
  • Output columns named clearly: Column_function (e.g., Sales_sum, Sales_mean)
  • Works with single group-by column
  • Works with multiple group-by columns
  • Works with multiple different columns
  • Documentation updated with examples
  • Error handling: clear error if function name is invalid

Test Cases

# Test 1: Two aggregations same column
xl group test.xlsx --by "Category" --aggregate "Sales:sum,Sales:count"
# Expected: Category | Sales_sum | Sales_count

# Test 2: Three aggregations same column
xl group test.xlsx --by "Product" --aggregate "Qty:sum,Qty:mean,Qty:std"
# Expected: Product | Qty_sum | Qty_mean | Qty_std

# Test 3: Multiple columns with multiple aggregations
xl group test.xlsx --by "Region" \
  --aggregate "Sales:sum,Sales:count,Profit:sum,Profit:mean"
# Expected: Region | Sales_sum | Sales_count | Profit_sum | Profit_mean

# Test 4: All aggregation functions
xl group test.xlsx --by "Store" \
  --aggregate "Revenue:sum,Revenue:mean,Revenue:median,Revenue:min,Revenue:max,Revenue:std,Revenue:var,Revenue:count"
# Expected: All 8 statistics calculated

# Test 5: Multiple group-by columns
xl group test.xlsx --by "Year,Region" \
  --aggregate "Sales:sum,Sales:count"
# Expected: Year | Region | Sales_sum | Sales_count

Implementation Notes

Column Naming Strategy

When multiple aggregations are applied, suffix each column with the function name:

# Input: --aggregate "Sales:sum,Sales:count"
# Output columns: Sales_sum, Sales_count

Alternative: Use parentheses

# Output columns: Sales(sum), Sales(count)

Recommendation: Underscore suffix (cleaner, easier to reference)

Aggregation Function Support

Ensure all documented functions work with multi-aggregation:

Function Description Example
sum Sum of values Sales:sum
mean Average (alias for avg) Sales:mean
avg Average Sales:avg
median Median value Sales:median
min Minimum value Sales:min
max Maximum value Sales:max
count Count of non-null values Sales:count
std Standard deviation Sales:std
var Variance Sales:var

Error Handling

# Invalid function name
xl group test.xlsx --by "Category" --aggregate "Sales:invalid"
# Error: "Unknown aggregation function 'invalid'. Valid functions: sum, mean, avg, median, min, max, count, std, var"

# Invalid column name
xl group test.xlsx --by "Category" --aggregate "NonExistent:sum"
# Error: "Column 'NonExistent' not found in data"

Related Issues

  • XL-002: No command piping (workaround requires 4 commands)
  • XL-003: No sorting in group (complementary feature)
  • XL-005: Limit/Top-N in group (complementary feature)

Combined Use Case

When combined with sorting and limiting:

# Top 10 products by sales with full statistics
xl group sales.xlsx --by "Product" \
  --aggregate "Sales:sum,Sales:count,Sales:mean" \
  --sort desc --limit 10

Workarounds

Current Workaround 1: Separate Commands

xl group data.xlsx --by "Category" --aggregate "Sales:sum" --output sum.xlsx
xl group data.xlsx --by "Category" --aggregate "Sales:count" --output count.xlsx
xl group data.xlsx --by "Category" --aggregate "Sales:mean" --output mean.xlsx
# Then merge in Excel/Python

Current Workaround 2: Use Python/pandas

import pandas as pd
df = pd.read_excel('data.xlsx')
result = df.groupby('Category')['Sales'].agg(['sum', 'count', 'mean', 'std'])
result.to_excel('output.xlsx')

Current Workaround 3: Manual Calculation

xl group data.xlsx --by "Category" --aggregate "Sales:sum"
# Manually calculate mean = sum / count in Excel

Benefits

  1. Workflow Efficiency: 4 commands → 1 command
  2. Data Consistency: All statistics from same data snapshot
  3. Better UX: Matches user expectations from SQL, pandas
  4. Time Savings: ~5 minutes per analysis
  5. Standard Behavior: SQL GROUP BY supports this, pandas supports this

SQL Equivalence

-- What users want (SQL):
SELECT
  Category,
  SUM(Sales) as Sales_sum,
  COUNT(Sales) as Sales_count,
  AVG(Sales) as Sales_mean,
  STDDEV(Sales) as Sales_std
FROM data
GROUP BY Category
ORDER BY Sales_sum DESC;

xl should match this capability.

Labels

bug feature-request high-priority aggregation group-by statistics

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh-priorityHigh priority issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions