Skip to content

Only initiate PatchPoint when needed.#565

Merged
linlin-s merged 3 commits intoupdate-patchesfrom
patchpoint-optimization
Sep 20, 2023
Merged

Only initiate PatchPoint when needed.#565
linlin-s merged 3 commits intoupdate-patchesfrom
patchpoint-optimization

Conversation

@linlin-s
Copy link
Contributor

Issue #, if available:
N/A

Note: This is the implementation on top of PR #521

Description of changes:

Previously, we allocated a placeholder patchpoint for every container. During the popContainer process, these placeholders were assigned the appropriate patchpoint values. However, for conditions that didn't require a patchpoint, we had to reclaim these placeholder patchpoints. To eliminate the need to reclaim unused placeholder patchpoints, we implemented the following changes:

Instead of initializing a placeholder patchpoint for each container, we now maintain an index of the patchpoint associated with each container. By default, the patchpoint index is set to -1, indicating that no patchpoint value has been assigned to that container.

During the popContainer process, child containers are popped first, while their parent containers remain in the stack. If the current container (the child container) meets a condition that requires a patchpoint, it implies that its ancestors also need patchpoints. At this point, we trace back to its ancestors and allocate a placeholder patchpoint to each ancestor and assign patchIndex for each ancestor container. This continues until we encounter an ancestor with an assigned patchpoint. We will replace the placeholder values with the correct data while we pop the ancestors. In order to test if the new changes show the expected performance improvements, we also include the benchmark results from using a generated test data to benchmark the new implementation. The generated test data contains a stream of 500000 nested container values, and all container requires patchpoint allocation.

Results summary:
When we benchmark the new implementation with the container-only test data, we found 7.12% performance regression of the current implementation comparing to the previous implementation(patch list as a single contiguous array). However, when we use the real world data for benchmarking, we found that comparing to the previous implementation there is 5.54% speed improvement using test data log_59155.ion, and 6.12% speed improvement using test data log_194617.ion. Overall, after the new implementations on patchpoints, we gained 1.84% improvement comparing to the original implementation on dataset log_59155.ion, 4.38% improvement on dataset log_194617.ion and 28.23% on dataset generatedContainerOnlyTestData.10n.

Test Data No change Patchpoints as Single Contiguous Array PatchPoint Initialization Optimization (Current) Improvement Comparing to No Change Improvement Comparing to Patchpoints as Single Contiguous Array
log_59155.ion 507.505 526.728 498.13 1.84% 5.42%
log_194617.ion 4108.958 4185.182 3928.665 4.38% 6.12%
generatedContainerOnlyTestData.10n 1060.129 710.243 760.815 28.23% -7.12%

Next step:
For the next step, we should investigate on finding the reason that caused performance regression while benchmarking with the container-only test data. By comparing the profiling results from those two implementations, we will find more insights on how to improve the current implementation.

Full benchmark results :
Benchmark a write of data equivalent to a of stream of 500000 nested container values using IonWriter(binary). The output data will write into an in-memory buffer. (3 forks, 2 warmups, 2 iterations, preallocation 1)

Benchmark No Change Patchpoints as Single Contiguous Array PatchPoint Initialization Optimization(current) Units
Bench.run 1060.129 710.243 760.815 ms/op
Bench.run:Heap usage 2494.653 1835.835 1542.986 MB
Bench.run:Serialized size 219 219 219 MB
Bench.run:·gc.alloc.rate 710.967 928.757 867.017 MB/sec
Bench.run:·gc.alloc.rate.norm 827411053 725002740 725002290 B/op
Bench.run:·gc.churn.G1_Eden_Space 295.055 290.346 271.915 MB/sec
Bench.run:·gc.churn.G1_Eden_Space.norm 343828070 226638884 227370070 B/op
Bench.run:·gc.churn.G1_Old_Gen 146.525 194.51 164.57 MB/sec
Bench.run:·gc.churn.G1_Old_Gen.norm 170611935 151583561 137733978 B/op
Bench.run:·gc.churn.G1_Survivor_Space 12.524 21.643 25.668 MB/sec
Bench.run:·gc.churn.G1_Survivor_Space.norm 14518855.5 16913286.1 21485215.9 B/op
Bench.run:·gc.count 83 137 128 counts
Bench.run:·gc.time 21521 7134 7007 ms

Benchmark a write of data equivalent to a of stream of 194617 nested binary data using IonWriter(binary). The output data will write into an in-memory buffer. (3 forks, 2 warmups, 2 iterations, preallocation 1)

Benchmark No Change Patchpoints as Single Contiguous Array PatchPoint Initialization Optimization(current) Units
Bench.run 4108.958 4185.182 3928.665 ms/op
Bench.run:Heap usage 3082.915 2621.584 2626.018 MB
Bench.run:Serialized size 201.663 201.663 201.663 MB
Bench.run:·gc.alloc.rate 155.321 152.185 159.247 MB/sec
Bench.run:·gc.alloc.rate.norm 703012445 696433814.2 685816196 B/op
Bench.run:·gc.churn.G1_Eden_Space 28.499 34.858 34.25 MB/sec
Bench.run:·gc.churn.G1_Eden_Space.norm 129324373 159616568.9 147499691 B/op
Bench.run:·gc.churn.G1_Old_Gen 49.755 74.813 79.79 MB/sec
Bench.run:·gc.churn.G1_Old_Gen.norm 225884103 342237838.2 343763940 B/op
Bench.run:·gc.churn.G1_Survivor_Space 2.343 0.176 2.934 MB/sec
Bench.run:·gc.churn.G1_Survivor_Space.norm 10555428.9 815559.111 12582912 B/op
Bench.run:·gc.count 13 12 14 counts
Bench.run:·gc.time 549 210 229 ms

Benchmark a write of data equivalent to a stream of 59155 nested binary Ion values. The output data will write into an in-memory buffer. (3 forks, 2 warmups, 2 iterations, preallocation 1)

Benchmark No change Patchpoints as Single Contiguous Array PatchPoint Initialization Optimization(current) Units
Bench.run 507.505 526.728 498.13 ms/op
Bench.run:Heap usage 359.78 333.188 409.035 MB
Bench.run:Serialized size 21.271 21.271 21.271 MB
Bench.run:·gc.alloc.rate 125.91 119.233 126.107 MB/sec
Bench.run:·gc.alloc.rate.norm 70381489.1 69094211.64 69108945.6 B/op
Bench.run:·gc.churn.G1_Eden_Space 7.382 3.498 3.247 MB/sec
Bench.run:·gc.churn.G1_Eden_Space.norm 4127727.75 2024487.523 1780914.79 B/op
Bench.run:·gc.churn.G1_Old_Gen 131.22 128.452 129.58 MB/sec
Bench.run:·gc.churn.G1_Old_Gen.norm 73365937 74415311.5 71001078.2 B/op
Bench.run:·gc.churn.G1_Survivor_Space 0.081 0.048 0.035 MB/sec
Bench.run:·gc.churn.G1_Survivor_Space.norm 46186.201 28189.211 19001.263 B/op
Bench.run:·gc.count 95 86 68 counts
Bench.run:·gc.time 728 145 113 ms

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentionally committed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentionally committed?

This shouldn't be committed, will be removed in the next commit.

ContainerInfo containerInfo = containerStack.push();
containerInfo.type = valueType;
containerInfo.endPosition = valueEndPosition;
containerStack.push(c -> c.initialize(valueType, valueEndPosition));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change that would warrant performance testing of binary incremental reading to ensure it does not introduce a regression. Do we see extra overhead introduced by creation and invocation of a lambda on each stepIn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check if the change introduce regression on binary incremental reading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the benchmark results of benchmarking incremental reader with 4 test data, and the results are neutral.

Test Data Before After
log_59155.ion 242.721 249.478
log_194617.ion 1858.668 1854.447
test.json 4273.663 4266.974
catalog.ion 0.476 0.478
singleValue.10n 0.063 0.063

{
// If we're adding a patch point we first need to ensure that all of our ancestors (containing values) already
// have a patch point. No container can be smaller than the contents, so all outer layers also require patches.
ListIterator<ContainerInfo> stackIterator = containers.iterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocates a new $Iterator every time. It looks like we could store and reuse a single instance, which may improve performance.

@linlin-s
Copy link
Contributor Author

Here is the benchmark comparison results between commit ad7c0f9 and 05acf07. According to the results, there is 0.78% performance improvements using test data generatedContainerOnlyTestData.10n, 1.13% performance improvements using test data log_59155.ion, and 0.69% performance improvements using test data log_194617.ion.

Test Data Before After Improvement
generatedContainerOnlyTestData.10n 770.678 764.687 0.78%
log_59155.ion 500.689 495.008 1.13%
log_194617.ion 3994.782 3967.299 0.69%

Here is the overall benchmark results after the change from commit 05acf07. (We re-run ion-java-benchmark-cli on every change to get this version of benchmarking results)

Test Data No Change Patchpoints as Single Contiguous Array PatchPoint Initialization Optimization (Updated) Improvement Comparing to No Change Improvement Comparing to Patchpoints as Single Contiguous Array
generatedContainerOnlyTestData.10n 1125.62 709.77 764.687 32.06% -7.73%
log_59155.ion 506.531 522.596 495.008 2.27% 5.28%
log_194617.ion 4087.047 4192.821 3967.299 2.92% 5.38%

Comment on lines 17 to 18
stackIterator = new $Iterator();
return stackIterator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be cleaner if in this method, you

  1. Allocate the iterator only if it has not yet been allocated, and
  2. Reset the iterator

That way you don't need a public resetIterator method, which you call every time you need to retrieve the iterator anyway. Also, you don't need to store a stackIterator variable in IonRawBinaryWriter, as retrieving it via containers.iterator() where you need it will be sufficient. In other words, iterator reuse can be achieved without any changes to IonRawBinaryWriter.

Copy link
Contributor Author

@linlin-s linlin-s Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions. I experimentally updated method iterator() with the following change:
Conditional initialization:

public ListIterator<T> iterator() {
        if (stackIterator != null) {
            stackIterator.cursor = _Private_RecyclingStack.this.currentIndex;
        } else {
            stackIterator = new $Iterator();
        }
        return stackIterator;
    }

One concern I have for this change is we need to run the if...else... condition check every time we call containers.iterator().
Unconditional initialization:

Then I tried to initialize $Iterator() while constructing the Recycling_Stack and iterator() method only reset the cursor.

Unconditional initialization:

public final class _Private_RecyclingStack<T> implements Iterable<T> {
    private $Iterator stackIterator;
    public ListIterator<T> iterator() {
        stackIterator.cursor = _Private_RecyclingStack.this.currentIndex;
        return stackIterator;
    }

   //Here is where we initialize the `stackIterator`
    private $Iterator stackIterator;
    public _Private_RecyclingStack(int initialCapacity, ElementFactory<T> elementFactory) {
        elements = new ArrayList<>(initialCapacity);
        this.elementFactory = elementFactory;
        currentIndex = -1;
        top = null;
        stackIterator = new $Iterator();
    }

This change might cause unnecessary iterator allocation while we do not need to iterate the stack. From the benchmark results the first method (conditional initialize iterator) is more performant. Would there be any other alternative implementations that I wasn't aware of? Thanks

Test Data conditional allocation unconditional allocation
log_59155.ion 496.576 501.664
log_194617.ion 3924.792 3965.73

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if/else is only a problem if it results in a noticeable performance degradation, but it doesn't look like it does. It's surprising that unconditional allocation is slower, but I think the conditional allocation is fine.

@linlin-s linlin-s merged commit 6a1cba1 into update-patches Sep 20, 2023
linlin-s added a commit that referenced this pull request Nov 7, 2023
linlin-s added a commit that referenced this pull request Nov 7, 2023
linlin-s added a commit that referenced this pull request Nov 7, 2023
@linlin-s linlin-s deleted the patchpoint-optimization branch January 16, 2024 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants