Skip to content

KAFKA-13629: Use faster algorithm for ByteUtils sizeOfXxx algorithm#11721

Merged
ijuma merged 9 commits intoapache:trunkfrom
jasonk000:kafka-13629-byteutils
Feb 6, 2022
Merged

KAFKA-13629: Use faster algorithm for ByteUtils sizeOfXxx algorithm#11721
ijuma merged 9 commits intoapache:trunkfrom
jasonk000:kafka-13629-byteutils

Conversation

@jasonk000
Copy link
Copy Markdown
Contributor

@jasonk000 jasonk000 commented Jan 30, 2022

Replace loop with a branch-free implementation.

Include:

  • Unit tests that includes old code and new code and runs through several ints/longs.
  • JMH benchmark that compares old vs new performance of algorithm.

JMH results with JDK 17.0.2 and compiler blackhole mode are 2.8-3.4 faster with
the new implementation. In a real application, a 6% reduction in CPU cycles was
observed in the send() path via flamegraphs.

ByteUtilsBenchmark.testSizeOfUnsignedVarint                            thrpt    4  1472440.102 ±  67331.797  ops/ms
ByteUtilsBenchmark.testSizeOfUnsignedVarint:·async                     thrpt               NaN                  ---
ByteUtilsBenchmark.testSizeOfUnsignedVarint:·gc.alloc.rate             thrpt    4       ≈ 10⁻⁴               MB/sec
ByteUtilsBenchmark.testSizeOfUnsignedVarint:·gc.alloc.rate.norm        thrpt    4       ≈ 10⁻⁷                 B/op
ByteUtilsBenchmark.testSizeOfUnsignedVarint:·gc.count                  thrpt    4          ≈ 0               counts
ByteUtilsBenchmark.testSizeOfUnsignedVarintSimple                      thrpt    4   521333.117 ± 595169.618  ops/ms
ByteUtilsBenchmark.testSizeOfUnsignedVarintSimple:·async               thrpt               NaN                  ---
ByteUtilsBenchmark.testSizeOfUnsignedVarintSimple:·gc.alloc.rate       thrpt    4       ≈ 10⁻⁴               MB/sec
ByteUtilsBenchmark.testSizeOfUnsignedVarintSimple:·gc.alloc.rate.norm  thrpt    4       ≈ 10⁻⁶                 B/op
ByteUtilsBenchmark.testSizeOfUnsignedVarintSimple:·gc.count            thrpt    4          ≈ 0               counts
ByteUtilsBenchmark.testSizeOfVarlong                                   thrpt    4  1106519.633 ±  16556.502  ops/ms
ByteUtilsBenchmark.testSizeOfVarlong:·async                            thrpt               NaN                  ---
ByteUtilsBenchmark.testSizeOfVarlong:·gc.alloc.rate                    thrpt    4       ≈ 10⁻⁴               MB/sec
ByteUtilsBenchmark.testSizeOfVarlong:·gc.alloc.rate.norm               thrpt    4       ≈ 10⁻⁶                 B/op
ByteUtilsBenchmark.testSizeOfVarlong:·gc.count                         thrpt    4          ≈ 0               counts
ByteUtilsBenchmark.testSizeOfVarlongSimple                             thrpt    4   324435.607 ± 147754.813  ops/ms
ByteUtilsBenchmark.testSizeOfVarlongSimple:·async                      thrpt               NaN                  ---
ByteUtilsBenchmark.testSizeOfVarlongSimple:·gc.alloc.rate              thrpt    4       ≈ 10⁻⁴               MB/sec
ByteUtilsBenchmark.testSizeOfVarlongSimple:·gc.alloc.rate.norm         thrpt    4       ≈ 10⁻⁶                 B/op
ByteUtilsBenchmark.testSizeOfVarlongSimple:·gc.count                   thrpt    4          ≈ 0               counts

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 31, 2022

Thanks for the PR. Can you please share the results for the micro-benchmark as well as any improvement you observed in real workloads?

@jasonk000
Copy link
Copy Markdown
Contributor Author

Overall, this results in ~6% reduction in CPU cycles in the send() path.

Before flamegraph (see highlighted section) - starting from send()
image

Before flamegraph (see highlighted) - zoomed in to the tryAppend()
image

After flamegraph (highlight turned on, but not showing up in any samples)
image

For the provided JMH microbenchmark, the new approach is ~3.4x better.

Benchmark                                                  Mode  Cnt    Score   Error   Units
ByteUtilsBenchmark.testSizeOfUnsignedVarint               thrpt    5  295.128 ± 7.481  ops/ms
ByteUtilsBenchmark.testSizeOfUnsignedVarintOriginal       thrpt    5   87.954 ± 2.671  ops/ms

}
return bytes;
int leadingZeros = Integer.numberOfLeadingZeros(value);
return LEADING_ZEROS_TO_U_VARINT_SIZE[leadingZeros];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if doing

  return (38 - leadingZeros) / 7 + leadingZeros / 32;

would be faster than the lookup.

Copy link
Copy Markdown
Contributor Author

@jasonk000 jasonk000 Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • updated with a bit of extra investigation

Good suggestion, I added a benchmark in 87d4b19. Although, it looks like the div, even when compiled down to shift operations, is not as quick.

Digging a little further I discovered that the length of the bitshift asm result in less loop unroll, so, I created a separate test that tests only a single element in bc1afab. Here the benefit does not seem as great but then when I investigate perfasm result the lookup based approach spends most of its time in the blackhole and only a fraction in lzcnt/cmp/mov.

The results are here:

Benchmark                                                     Mode  Cnt       Score      Error   Units
ByteUtilsBenchmark.testSizeOfUnsignedVarintMathOne           thrpt   10  424743.573 ± 3404.300  ops/ms
ByteUtilsBenchmark.testSizeOfUnsignedVarintOne               thrpt   10  602155.871 ± 3100.966  ops/ms
ByteUtilsBenchmark.testSizeOfUnsignedVarintOriginalOne       thrpt   10  451803.982 ±  647.814  ops/ms

Looking at -prof perfasm, the math edition does use lzcnt and does compile down to bitshifts, they just turn out relatively expensive. (This is on a 5950X).

The key to note is that the math-based one requires an lzcnt followed by some integer math and bitshifts, whilst the lookup requires an lzcnt and only a cmp/mov.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. It's unfortunate that the second operation is a div in Java (I'm new to Java, so maybe there is a way to express it differently, but I couldn't think of one), in C++ I would've just done:

   return (38 - leadingZeros) / 7 + (leadingZeros == 32);

which would just do cmp and add carryover bit, which is cheaper than div. Maybe the compiler would be smart enough to translate ((leadingZeros == 32) ? 1 : 0) into math expression rather than do a branch?
In these benchmarks, the lookup table is likely cached in L1 cache (hot loop that hits the same small amount of data), so memory access in the benchmark is likely cheaper than on average. It's probably hard to do a proper model and I'm not sure if it's worth it.
In any case, thanks for doing this comprehensive research, good stuff!

Copy link
Copy Markdown
Contributor Author

@jasonk000 jasonk000 Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree - the arithmetic / bitshift looks intuitively quicker since it requires no loads. Looking at the compiler output from JDK11, it does get compiled down to a sequence of operations that are reasonably clean (no DIV), just that there are a lot of instructions. Perhaps there is a slight advantage in different use-cases, but, in my case this fix takes the function completely out of the profiler results so it solves the problem.

Out of interest sake, here is the sequence of operations according to perfasm on JDK11.

         ││  0x00007f26e04797b7: mov     0x10(%r12,%r10,8),%r10d
  0.05%  ││  0x00007f26e04797bc: lzcnt   %r10d,%r11d       ;*invokestatic numberOfLeadingZeros {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - org.apache.kafka.jmh.util.ByteUtilsBenchmark::testSizeOfUnsignedVarintMathOne@6 (line 76)
         ││                                                ; - org.apache.kafka.jmh.util.jmh_generated.ByteUtilsBenchmark_testSizeOfUnsignedVarintMathOne_jmhTest::testSizeOfUnsignedVarintMathOne_thrpt_jmhStub@17 (line 119)
  0.03%  ││  0x00007f26e04797c1: mov     $0x26,%r8d
  0.01%  ││  0x00007f26e04797c7: sub     %r11d,%r8d        ;*isub {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - org.apache.kafka.jmh.util.ByteUtilsBenchmark::testSizeOfUnsignedVarintMathOne@13 (line 77)
         ││                                                ; - org.apache.kafka.jmh.util.jmh_generated.ByteUtilsBenchmark_testSizeOfUnsignedVarintMathOne_jmhTest::testSizeOfUnsignedVarintMathOne_thrpt_jmhStub@17 (line 119)
  0.49%  ││  0x00007f26e04797ca: mov     %r11d,%r9d
  0.36%  ││  0x00007f26e04797cd: sar     $0x1f,%r9d
  0.02%  ││  0x00007f26e04797d1: movsxd  %r8d,%r10
  0.77%  ││  0x00007f26e04797d4: shr     $0x1b,%r9d
  7.24%  ││  0x00007f26e04797d8: add     %r11d,%r9d
  0.78%  ││  0x00007f26e04797db: imulq   $0x92492493,%r10,%r10
  0.05%  ││  0x00007f26e04797e2: sar     $0x5,%r9d
  2.24%  ││  0x00007f26e04797e6: sar     $0x20,%r10
  9.93%  ││  0x00007f26e04797ea: movsxd  %r9d,%r11
  0.12%  ││  0x00007f26e04797ed: mov     %r10d,%r10d
         ││  0x00007f26e04797f0: add     %r8d,%r10d
  6.30%  ││  0x00007f26e04797f3: sar     $0x1f,%r8d
  2.53%  ││  0x00007f26e04797f7: sar     $0x2,%r10d
  6.21%  ││  0x00007f26e04797fb: movsxd  %r8d,%r8
         ││  0x00007f26e04797fe: movsxd  %r10d,%rdx
  7.25%  ││  0x00007f26e0479801: sub     %r8,%rdx
  7.01%  ││  0x00007f26e0479804: add     %r11,%rdx         ;*i2l {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - org.apache.kafka.jmh.util.ByteUtilsBenchmark::testSizeOfUnsignedVarintMathOne@22 (line 77)
         ││                                                ; - org.apache.kafka.jmh.util.jmh_generated.ByteUtilsBenchmark_testSizeOfUnsignedVarintMathOne_jmhTest::testSizeOfUnsignedVarintMathOne_thrpt_jmhStub@17 (line 119)
  8.11%  ││  0x00007f26e0479807: mov     (%rsp),%rsi

Lookup table:

         │  0x00007fde904792a7: mov     0x10(%r12,%r10,8),%r10d
  1.42%  │  0x00007fde904792ac: lzcnt   %r10d,%r10d       ;*invokestatic numberOfLeadingZeros {reexecute=0 rethrow=0 return_oop=0}
         │                                                ; - org.apache.kafka.common.utils.ByteUtils::sizeOfUnsignedVarint@1 (line 420)
         │                                                ; - org.apache.kafka.jmh.util.ByteUtilsBenchmark::testSizeOfUnsignedVarintOne@6 (line 61)
         │                                                ; - org.apache.kafka.jmh.util.jmh_generated.ByteUtilsBenchmark_testSizeOfUnsignedVarintOne_jmhTest::testSizeOfUnsignedVarintOne_thrpt_jmhStub@17 (line 119)
  1.26%  │  0x00007fde904792b1: cmp     $0x21,%r10d
  2.65%  │  0x00007fde904792b5: jnb     0x7fde90479332
  0.02%  │  0x00007fde904792b7: mov     %r8,0x40(%rsp)
  0.02%  │  0x00007fde904792bc: movabs  $0xfe434bf0,%r11  ;   {oop([I{0x00000000fe434bf0})}
  1.68%  │  0x00007fde904792c6: movsxd  0x10(%r11,%r10,4),%rdx  ;*i2l {reexecute=0 rethrow=0 return_oop=0}
         │                                                ; - org.apache.kafka.jmh.util.ByteUtilsBenchmark::testSizeOfUnsignedVarintOne@9 (line 61)
         │                                                ; - org.apache.kafka.jmh.util.jmh_generated.ByteUtilsBenchmark_testSizeOfUnsignedVarintOne_jmhTest::testSizeOfUnsignedVarintOne_thrpt_jmhStub@17 (line 119)
 37.64%  │  0x00007fde904792cb: mov     (%rsp),%rsi

It may be that the code in yours is faster when dropped into a full application. On the other hand, a single load dependency is something that can be easily predicted and pipelined...

Perhaps there are a more direct series of shift operations that would be better suited?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does seem like a lot of instructions, some of them don't seem to be needed in this case where we deal with small unsigned integers (e.g. movsxd %r8d,%r8 is extending the sign to 64 bits, but we don't need it). Java doesn't seem to support unsigned, so not sure if there is a way to hint the compiler that we don't need instructions that only matter for signed arithmetic. Maybe using long would help at least to eliminate the need to do movsxd.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artemlivshits i've updated the implementation in f040109.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Copy Markdown

@twmb twmb Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to chime in late here, but I'm wondering if a lookup table may still be faster. You can see these two instructions in your lookup table implementation above,

  1.26%  │  0x00007fde904792b1: cmp     $0x21,%r10d
  2.65%  │  0x00007fde904792b5: jnb     0x7fde90479332

this is comparing if the calculated number (leading zeroes) fits within the size-33 array. Does Java have the concept of a 256 sized array, and uint8 numbers?

In my Go implementation (which uses bits.Len, i.e., the opposite of leading zeroes, here: https://github.com/twmb/franz-go/blob/0a23cca3f4ee9d69b6cb2e70aab1c8e012b901ad/pkg/kbin/primitives.go#L76-L87), using a size-256 string and a uint8 allows for no bounds check, resulting in even fewer instructions and shaving a fraction of a nanosecond of time in comparison to the pure math version.

In this benchmark,

func BenchmarkUvarintLen(b *testing.B) {
	z := 0
	for i := 0; i < b.N; i++ {
		z += UvarintLen(uint32(i))
	}
	b.Log(z)
}

Comparing the existing code to this code:

func UvarintLen(u uint32) int {
        lz := bits.LeadingZeros32(u)
        return (((38 - lz) * 0b10010010010010011) >> 19) + (lz >> 5)}

Top is math version, bottom is lookup table:

BenchmarkUvarintLen-8   	1000000000	         1.179 ns/op
BenchmarkUvarintLen-8   	1000000000	         0.7811 ns/op

Copy link
Copy Markdown
Contributor Author

@jasonk000 jasonk000 Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your approach here, and it's something that should be possible, but, I don't see it being applied, the below still includes the cmp/jump

final static byte[] LEADING_ZEROS_TO_U_VARINT_SIZE = new byte[4096];

Byte is -127 to +128 in Java, so 4096 should be plenty.

byte leadingZeros = (byte) Integer.numberOfLeadingZeros(inputInt);
return LEADING_ZEROS_TO_U_VARINT_SIZE[leadingZeros];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah bummer, but interesting! Fun between languages :)

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 2, 2022

Thanks for the detailed analysis! Note that JMH 1.34 (https://mail.openjdk.java.net/pipermail/jmh-dev/2021-December/003406.html) has support for much cheaper blackholes if executed with Java 17 or later. On that note, which Java version are you using in your experiments?

@jasonk000
Copy link
Copy Markdown
Contributor Author

jasonk000 commented Feb 2, 2022

Good spot on jmh 1.34, TIL! Here's a result from a re-run

jkoch@jkoch:~/code/kafka$ java -version
openjdk version "17.0.1" 2021-10-19
OpenJDK Runtime Environment (build 17.0.1+12-Ubuntu-121.10)
OpenJDK 64-Bit Server VM (build 17.0.1+12-Ubuntu-121.10, mixed mode, sharing)
jkoch@jkoch:~/code/kafka$ cat gradle/dependencies.gradle  | grep jmh
  jmh: "1.34",
  jmhCore: "org.openjdk.jmh:jmh-core:$versions.jmh",
  jmhCoreBenchmarks: "org.openjdk.jmh:jmh-core-benchmarks:$versions.jmh",

With -perfnorm, the interesting results seem to be:

  • IPC, cache miss rates are similarly good across all
  • The math version has 2 branches, the lookup table has 3 branches, and the baseline loop has 6 branches. All well predicted.
  • It seems to come simply down to instruction counts: math has 31, lookup has 15, and baseline has 27.
Benchmark                                                                        Mode  Cnt        Score       Error      Units
ByteUtilsBenchmark.testSizeOfUnsignedVarintMathOne                              thrpt    5   520507.594 ±  3323.803     ops/ms
ByteUtilsBenchmark.testSizeOfUnsignedVarintMathOne:IPC                          thrpt             4.759              insns/clk
ByteUtilsBenchmark.testSizeOfUnsignedVarintMathOne:branches                     thrpt             2.002                   #/op
ByteUtilsBenchmark.testSizeOfUnsignedVarintMathOne:cycles                       thrpt             6.518                   #/op
ByteUtilsBenchmark.testSizeOfUnsignedVarintMathOne:instructions                 thrpt            31.016                   #/op
ByteUtilsBenchmark.testSizeOfUnsignedVarintOne                                  thrpt    5  1035024.683 ± 25061.922     ops/ms
ByteUtilsBenchmark.testSizeOfUnsignedVarintOne:IPC                              thrpt             4.576              insns/clk
ByteUtilsBenchmark.testSizeOfUnsignedVarintOne:branches                         thrpt             3.000                   #/op
ByteUtilsBenchmark.testSizeOfUnsignedVarintOne:cycles                           thrpt             3.278                   #/op
ByteUtilsBenchmark.testSizeOfUnsignedVarintOne:instructions                     thrpt            15.000                   #/op
ByteUtilsBenchmark.testSizeOfUnsignedVarintOriginalOne                          thrpt    5   676262.808 ±  4065.238     ops/ms
ByteUtilsBenchmark.testSizeOfUnsignedVarintOriginalOne:IPC                      thrpt             5.385              insns/clk
ByteUtilsBenchmark.testSizeOfUnsignedVarintOriginalOne:branches                 thrpt             6.001                   #/op
ByteUtilsBenchmark.testSizeOfUnsignedVarintOriginalOne:cycles                   thrpt             5.016                   #/op
ByteUtilsBenchmark.testSizeOfUnsignedVarintOriginalOne:instructions             thrpt            27.013                   #/op

Comment thread clients/src/main/java/org/apache/kafka/common/utils/ByteUtils.java
Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, this looks good to me overall. I left a few minor comments that we should address before merging.

// return (38 - leadingZeros) / 7 + leadingZeros / 32;
int leadingZerosBelow38DividedBy7 = ((38 - leadingZeros) * 0b10010010010010011) >>> 19;
return leadingZerosBelow38DividedBy7 + (leadingZeros >>> 5);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this here since it matches the default implementation?


@Test
public void testSizeOfUnsignedVarintMath() {
for (int i = 0; i < Integer.MAX_VALUE; i++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too slow for a unit test? Similar question for other tests that do a large range of values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests I have now run in ~0.6 seconds.

/**
* The old well-known implementation for sizeOfUnsignedVarint
*/
private static int oldSizeOfUnsignedVarint(int value) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can call this simpleSizeOfUnsignedVarint or something like that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I nested these inside the test case, which I think is cleaner.

@Fork(3)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
public int testSizeOfUnsignedVarintOriginal() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can say Simple instead of Original (or whatever name we agree for the test, we should use it here too for consistency).

@jasonk000
Copy link
Copy Markdown
Contributor Author

Thanks @ijuma , I believe I've sorted these out in e891ebf.

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, looks like spotBugs is not happy. While fixing that, maybe you can also tackle the two minor nits below. We can hopefully merge after that.


// Similar logic is applied to allow for 64bit input -> 1-9byte output.

// return (70 - leadingZeros) / 7 + leadingZeros / 64;
Copy link
Copy Markdown
Member

@ijuma ijuma Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not sure we need the empty lines between comments here, seems pretty readable without them.

@Benchmark
@Fork(3)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have these at class level since they're the same for every benchmark?

@jasonk000
Copy link
Copy Markdown
Contributor Author

I've addressed comments, I'll wait and see what spotbugs says this time. Locally, it's all clear, on CI it shows up issues with code not related to this PR.

@jasonk000
Copy link
Copy Markdown
Contributor Author

I had misread spotbugs output and used the CI spotbugs steps, fixed it now =)

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@ijuma ijuma changed the title KAFKA-13629: use faster algorithm for ByteUtils sizeOfXxx algorithm KAFKA-13629: Use faster algorithm for ByteUtils sizeOfXxx algorithm Feb 6, 2022
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 6, 2022

I updated the PR description to include the JMH results I got with JDK 17.0.2 and compiler blackhole mode.

@ijuma ijuma merged commit c05403f into apache:trunk Feb 6, 2022
@showuon
Copy link
Copy Markdown
Member

showuon commented Feb 7, 2022

@jasonk000 , thanks for the good improvement! I learned a lot from the PR! Also updated the JIRA status. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants