Remove "granularity" from IngestSegmentFirehose.#4110
Conversation
It wasn't doing anything useful (the sequences were being concatted, and cursor.getTime() wasn't being called) and it defaulted to Granularities.NONE. Changing it to Granularities.ALL gave me a 700x+ performance boost on a small dataset I was reindexing (2m27s to 365ms). Most of that was from avoiding making a lot of unnecessary column selectors.
| @@ -77,7 +76,7 @@ public Sequence<InputRow> apply(WindowedStorageAdapter adapter) | |||
| Filters.toFilter(dimFilter), | |||
There was a problem hiding this comment.
Maybe could be further simplified by not calling concat() a few lines above.
There was a problem hiding this comment.
How else would this be turned into a Sequence<InputRow> rather than Sequence<Sequence<InputRow>>?
It wasn't doing anything useful (the sequences were being concatted, and cursor.getTime() wasn't being called) and it defaulted to Granularities.NONE. Changing it to Granularities.ALL gave me a 700x+ performance boost on a small dataset I was reindexing (2m27s to 365ms). Most of that was from avoiding making a lot of unnecessary column selectors.
It wasn't doing anything useful (the sequences were being concatted, and cursor.getTime() wasn't being called) and it defaulted to Granularities.NONE. Changing it to Granularities.ALL gave me a 700x+ performance boost on a small dataset I was reindexing (2m27s to 365ms). Most of that was from avoiding making a lot of unnecessary column selectors.
|
@gianm it appears that this breaks re-indexing which expects IngestSegmentFireHose to give individual rows from the segment without any truncation. |
|
@himanshug What specifically has broken? IIRC the rows still do have their original, unchanged timestamps -- the only difference is that the cursor timestamps are truncated. But reindexing shouldn't be using the cursor timestamps anyway. |
|
ok, I assumed |
|
IIRC what happens is the |
| @@ -171,7 +158,6 @@ public DatasourceIngestionSpec withQueryGranularity(Granularity granularity) | |||
| intervals, | |||
There was a problem hiding this comment.
Now this method is effectively just "clone", the method argument is unused.
It wasn't doing anything useful (the sequences were being concatted, and
cursor.getTime() wasn't being called) and it defaulted to Granularities.NONE.
Changing it to Granularities.ALL gave me a 700x+ performance boost on a
small dataset I was reindexing (2m27s to 365ms). Most of that was from avoiding
making a lot of unnecessary column selectors.