Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Jun 28, 2021

No description provided.

@nastra nastra marked this pull request as draft June 28, 2021 09:45
@github-actions github-actions bot added the arrow label Jun 28, 2021
@nastra nastra force-pushed the arrow-refactoring branch 2 times, most recently from ed4ca7e to 0addb2c Compare June 28, 2021 16:16
@github-actions github-actions bot added the spark label Jun 28, 2021
@nastra nastra force-pushed the arrow-refactoring branch from 0addb2c to 0652940 Compare June 28, 2021 17:05
@nastra nastra marked this pull request as ready for review June 29, 2021 05:02
@nastra
Copy link
Contributor Author

nastra commented Jun 29, 2021

@rymurr it is probably easier looking at the modified files directly instead of looking at the diff when reviewing.

@nastra nastra force-pushed the arrow-refactoring branch from 0652940 to 4ddd3fa Compare June 29, 2021 08:03
@nastra
Copy link
Contributor Author

nastra commented Jun 29, 2021

Results on branch master

Benchmark                                                                 Mode  Cnt  Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5  1.933 ± 0.171   s/op
VectorizedReadFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5  1.548 ± 0.040   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  8.692 ± 0.245   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5  7.726 ± 0.163   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5  2.564 ± 0.074   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5  2.419 ± 0.094   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5  2.466 ± 0.128   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5  2.274 ± 0.033   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5  2.385 ± 0.171   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5  2.437 ± 0.125   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5  2.490 ± 0.115   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5  2.654 ± 0.182   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5  4.884 ± 0.381   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5  4.366 ± 0.439   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5  1.729 ± 0.150   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5  1.896 ± 0.174   s/op

Results on branch arrow-refactoring

Benchmark                                                                 Mode  Cnt  Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5  1.611 ± 0.035   s/op
VectorizedReadFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5  1.496 ± 0.072   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  8.461 ± 0.251   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5  8.250 ± 0.083   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5  2.872 ± 0.050   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5  2.781 ± 0.129   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5  2.752 ± 0.216   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5  2.261 ± 0.101   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5  2.743 ± 0.123   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5  2.652 ± 0.221   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5  2.754 ± 0.818   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5  2.682 ± 0.136   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5  4.611 ± 0.066   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5  3.901 ± 0.143   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5  1.661 ± 0.062   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5  1.572 ± 0.056   s/op

Detailed results are attached below:
master_detailed_results_new.txt
refactoring_detailed_results_new.txt

@nastra nastra force-pushed the arrow-refactoring branch from 4ddd3fa to 5ebdd4c Compare July 2, 2021 07:45
Copy link
Contributor

@rymurr rymurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment @nastra . Looks great though, the vectorised code is a lot easier to read! I love deleting code!

class DictionaryIdReader extends BaseDictEncodedReader {
@Override
protected void nextVal(FieldVector vector, Dictionary dict, int idx, int currentVal, int typeWidth) {
((IntVector) vector).set(idx, currentVal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is cast to IntVector and the others are directly manipulating the data buffer?

Copy link
Contributor Author

@nastra nastra Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary encoded vectors are always represented as IntVector, but since BaseDictEncodedReader uses the more generic FieldVector we need to do a cast to IntVector here. Fwiw, here's how it was done in the original code:

and
vectorizedColumnIterator.nextBatchDictionaryIds((IntVector) vec, nullabilityHolder);

Copy link
Contributor

@rymurr rymurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome @nastra

@rymurr
Copy link
Contributor

rymurr commented Jul 9, 2021

@rdblue: @nastra suggested merging this w/o squashing as each commit atomically refactors a single class. WDYT?

@rdblue
Copy link
Contributor

rdblue commented Jul 9, 2021

I'm fine either way. Whatever you'd like to do.

@nastra nastra force-pushed the arrow-refactoring branch from 5ebdd4c to 6c12d43 Compare July 12, 2021 08:39
@rymurr
Copy link
Contributor

rymurr commented Jul 12, 2021

note - rebasing to keep each atomic class refactor as a separate (revertible) commit

@rymurr rymurr merged commit 8058ec1 into apache:master Jul 12, 2021
@nastra nastra deleted the arrow-refactoring branch July 12, 2021 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants