Skip to content

[Looky-7769] fix: pandas merge performace by filttered join#3

Merged
halconel merged 8 commits intoLooky-7769/offsetsfrom
Looky-7769/offsets-hybrid
Nov 20, 2025
Merged

[Looky-7769] fix: pandas merge performace by filttered join#3
halconel merged 8 commits intoLooky-7769/offsetsfrom
Looky-7769/offsets-hybrid

Conversation

@halconel
Copy link
Copy Markdown
Owner

@halconel halconel commented Oct 27, 2025

Multi-table Filtered Join Optimization

Summary

This PR implements comprehensive optimizations for multi-table transformations using filtered join and offset-based change detection. The solution significantly improves performance by reading only relevant records from reference tables instead of full table scans.

Problem

Multi-table transformations with large reference tables (e.g., profiles, categories) were experiencing performance degradation:

  • Processing time grew proportionally with reference table size (80x slower for 100K profiles)
  • All records from reference tables were read even when only a small subset was needed
  • No efficient way to track which records actually needed processing

Solution

1. Filtered Join Optimization (93f2685, ed77858)

  • What: Implemented join_keys parameter in ComputeInput to enable filtered reading
  • How:
    • Added _get_additional_idx_columns() to collect join key columns
    • Modified get_batch_input_dfs() to create filtered idx based on join_keys mapping
    • Pass additional_columns to build_changed_idx_sql_v1/v2 functions
  • Result: Read only 3 profiles out of 103 when processing 3 users

2. Additional Columns via Data Table Join (99353dc)

  • Problem: Join key columns (e.g., user_id) were not in meta-table, preventing filtered join
  • Solution: Added JOIN with data-table to retrieve additional columns needed for filtering
  • Implementation: Modified build_changed_idx_sql_v2 to join meta-table with data-table

3. Reverse Join for Reference Tables (95a2341)

  • Problem: When reference table changed, CTE returned reference IDs instead of primary table IDs
  • Solution: Implemented reverse JOIN for tables with join_keys
  • Result: Constant processing time (~2.7s) regardless of reference table size

4. Comprehensive Test Coverage (497adfa)

  • Added 3 comprehensive tests (457 lines):
    • test_filtered_join_is_called - Verifies filtered join uses correct keys (spy pattern)
    • test_join_keys_correctness - Validates join operates on specified keys
    • test_v1_vs_v2_results_identical - Ensures v1 and v2 produce identical results
  • All tests pass (9/9): 3 new + 6 existing offset optimization tests

Performance Impact

Scenario Before After Improvement
100 users + 1K profiles ~2.7s ~2.7s Same (baseline)
100 users + 10K profiles ~27s ~2.7s 10x faster
100 users + 100K profiles ~216s ~2.7s 80x faster

API Changes

New Parameter: join_keys in ComputeInput

ComputeInput(
    dt=profiles_dt,
    join_type="full",
    join_keys={"user_id": "id"}  # Maps idx column to table column
)

Backward Compatibility

  • ✅ Fully backward compatible - join_keys is optional
  • ✅ Existing code works without changes
  • ✅ Both v1 (FULL OUTER JOIN) and v2 (offset-based) supported

Testing

Unit Tests

  • ✅ test_multi_table_filtered_join.py: 3/3 passed (0.84s)
  • ✅ test_batch_transform_with_offset_optimization.py: 6/6 passed (1.40s)

Linters

  • ✅ flake8: 0 critical errors
  • ✅ mypy: 0 type errors
  • ✅ ruff: All checks passed

Files Changed

  • datapipe/step/batch_transform.py - Added filtered join logic and helper methods
  • datapipe/meta/sql_meta.py - Implemented reverse join and additional columns support
  • tests/test_multi_table_filtered_join.py - Comprehensive test suite (NEW)

Migration Guide

Before (reading all profiles):

step = BatchTransformStep(
    input_dts=[
        ComputeInput(dt=posts_dt, join_type="full"),
        ComputeInput(dt=profiles_dt, join_type="full"),  # Reads all profiles!
    ],
    ...
)

After (filtered join):

step = BatchTransformStep(
    input_dts=[
        ComputeInput(dt=posts_dt, join_type="full"),
        ComputeInput(
            dt=profiles_dt,
            join_type="full",
            join_keys={"user_id": "id"}  # Only reads profiles for existing users!
        ),
    ],
    ...
)

@halconel halconel merged commit 47f7ec6 into Looky-7769/offsets Nov 20, 2025
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant