Current process seems to be overcomplicated.
The current steps:
- Produce per-index bitmaps, using unsorted per-index encoding and per-index row numbers:
IncrementalIndexAdapter.processRows() (called from IncrementalIndexAdapter constructor)
- Sort per-index dictionary. See
StringDimensionIndexer.SortedDimensionDictionary: it has three O(N)-sized structures: for
a) sorted-to-unsorted encoding conversion,
b) unsorted-to-sorted encoding conversion,
c) values in the sorted order itself (for iteration later; however it's materialization is not needed for sure, the list from DimensionDictionary and sorted-to-unsorted encoding could be combined)
Additionally, when preparing those three structures, another O(N) structure is temporarily created (a TreeMap)
- When merging per-index sorted dictionaries, O(N) per-index-sorted-to-global-sorted encoding conversion structures are created: see
IndexMerger.DictionaryMergeIterator
- Before global merge of rows, encoded row values are converted twice: firstly, per-index-unsorted-to-sorted, using the structure from 2.a), and secondly, per-index-sorted-to-globally-sorted, using the structure from 3).
- When merging rows of indexes into a global row stream, per-index-row-number-to-global-row-number conversion structures are created: see
IndexMergerV9. mergeIndexesAndWriteColumns().
- After writing out the global stream of rows, per-index bitmaps are merged. This merge requires the structures from 2.b) and 5). See
StringDimensionMergerV9.mergeBitmaps().
I don't see why it couldn't be simplified to the following:
- Create a sorted-to-unsorted encoding conversion. It could be done by creating
int[], fill it with 0, 1, .., and then sort it with strategy, delegating comparison to the DimensionDictionary's array of String values. I. e. one O(N) structure is created on this step.
- When merging per-index sorted dictionaries (as explained above, it could be done by iterating the structure, created on the previous step, and composing it with
DimensionDictionary's array), create per-index-unsorted-to-global-sorted encoding conversion: O(N) structure.
- Before global merge of rows, convert encoded row values once, using the structure, created on the previous step.
- While merging and writing out a global row stream, global bitmaps are created as we go.
The main question is why we need to create per-index bitmaps in the very beginning, then create a lot of auxiliary structures to keep them usable until the very end, and them merge. If we could(?) just create bitmaps in the end.
@jon-wei @gianm @jihoonson
Current process seems to be overcomplicated.
The current steps:
IncrementalIndexAdapter.processRows()(called fromIncrementalIndexAdapterconstructor)StringDimensionIndexer.SortedDimensionDictionary: it has three O(N)-sized structures: fora) sorted-to-unsorted encoding conversion,
b) unsorted-to-sorted encoding conversion,
c) values in the sorted order itself (for iteration later; however it's materialization is not needed for sure, the list from
DimensionDictionaryand sorted-to-unsorted encoding could be combined)Additionally, when preparing those three structures, another O(N) structure is temporarily created (a TreeMap)
IndexMerger.DictionaryMergeIteratorIndexMergerV9. mergeIndexesAndWriteColumns().StringDimensionMergerV9.mergeBitmaps().I don't see why it couldn't be simplified to the following:
int[], fill it with 0, 1, .., and then sort it with strategy, delegating comparison to theDimensionDictionary's array of String values. I. e. one O(N) structure is created on this step.DimensionDictionary's array), create per-index-unsorted-to-global-sorted encoding conversion: O(N) structure.The main question is why we need to create per-index bitmaps in the very beginning, then create a lot of auxiliary structures to keep them usable until the very end, and them merge. If we could(?) just create bitmaps in the end.
@jon-wei @gianm @jihoonson