Skip to content

Simplify index merging algorithm #5526

@leventov

Description

@leventov

Current process seems to be overcomplicated.

The current steps:

  1. Produce per-index bitmaps, using unsorted per-index encoding and per-index row numbers: IncrementalIndexAdapter.processRows() (called from IncrementalIndexAdapter constructor)
  2. Sort per-index dictionary. See StringDimensionIndexer.SortedDimensionDictionary: it has three O(N)-sized structures: for
    a) sorted-to-unsorted encoding conversion,
    b) unsorted-to-sorted encoding conversion,
    c) values in the sorted order itself (for iteration later; however it's materialization is not needed for sure, the list from DimensionDictionary and sorted-to-unsorted encoding could be combined)
    Additionally, when preparing those three structures, another O(N) structure is temporarily created (a TreeMap)
  3. When merging per-index sorted dictionaries, O(N) per-index-sorted-to-global-sorted encoding conversion structures are created: see IndexMerger.DictionaryMergeIterator
  4. Before global merge of rows, encoded row values are converted twice: firstly, per-index-unsorted-to-sorted, using the structure from 2.a), and secondly, per-index-sorted-to-globally-sorted, using the structure from 3).
  5. When merging rows of indexes into a global row stream, per-index-row-number-to-global-row-number conversion structures are created: see IndexMergerV9. mergeIndexesAndWriteColumns().
  6. After writing out the global stream of rows, per-index bitmaps are merged. This merge requires the structures from 2.b) and 5). See StringDimensionMergerV9.mergeBitmaps().

I don't see why it couldn't be simplified to the following:

  1. Create a sorted-to-unsorted encoding conversion. It could be done by creating int[], fill it with 0, 1, .., and then sort it with strategy, delegating comparison to the DimensionDictionary's array of String values. I. e. one O(N) structure is created on this step.
  2. When merging per-index sorted dictionaries (as explained above, it could be done by iterating the structure, created on the previous step, and composing it with DimensionDictionary's array), create per-index-unsorted-to-global-sorted encoding conversion: O(N) structure.
  3. Before global merge of rows, convert encoded row values once, using the structure, created on the previous step.
  4. While merging and writing out a global row stream, global bitmaps are created as we go.

The main question is why we need to create per-index bitmaps in the very beginning, then create a lot of auxiliary structures to keep them usable until the very end, and them merge. If we could(?) just create bitmaps in the end.

@jon-wei @gianm @jihoonson

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions