You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a PEP-Request. I propose to create a extensible SPI that can be used to change the PinotDataBuffer implementation used at runtime. Treat this issue as a clone of #9162, but more formal.
What needs to be done?
Right now PinotDataBuffer instances are obtained by calling static methods:
PinotDataBuffer loadFile(File file, long offset, long size, ByteOrder byteOrder, String description)
PinotDataBuffer mapFile(File file, boolean readOnly, long offset, long size, ByteOrder byteOrder, String description)
These methods have a static implementation that return a PinotDataBuffer that is backed by a ByteBuffer or by a LArray, depending on whether the requested size is greater than 2GBs or not. ByteBuffers are faster and more reliable than LArray, but they cannot be larger than Integer.MAX_INT (aka 2GB - 1byte).
Here I propose to change that implementation. Instead of having a static implementation, the algorithm used to instantiate the PinotDataBuffer will be delegated to an interface. The specific implementation will be set at Pinot startup time by reading the property pinot.offheap.buffer.factory, which will be an optional property whose value is a String. In case it exists, it should the qualified class name of a Class that implements PinotDataBufferFactory. There will be also another optional property called pinot.offheap.prioritize.bytebuffer which will be a boolean. If it is true, then the ByteBuffer implementation will be used when less than 2GB buffers are requested.
Why the feature is needed
The reason to be able to have new PinotDataBuffer implementations is that LArray is not compatible with Java >= 16, as detected in #8529. It also seems that our LArray implementation may have some bugs (see #10774) and it doesn't seem that LArray will be updated (See xerial/larray#75), so we need to move on.
Initial idea/proposal
Adding a SPI is something we have done several times in the last months/years. Most of the times we use one or more configuration properties to decide which instance we need to use and set that instance in a static attribute. That system works, but it is problematic when we want to change the implementation in an isolated way (for example, when doing tests).
That is why I propose to extend this system with a thread local on top of that. Something like:
/** * The default {@link PinotBufferFactory} used by all threads that do not define their own factory. */privatestaticPinotBufferFactory_defaultFactory = createDefaultFactory();
/** * A thread local variable that can be used to customize the {@link PinotBufferFactory} used on tests. This is mostly * useful in tests. */privatestaticfinalThreadLocal<PinotBufferFactory> _FACTORY = newThreadLocal<>();
/** * Creates the default factory depending on the JVM version */publicstaticPinotBufferFactorycreateDefaultFactory() {...}
/** * Changes the default factory */publicstaticvoidsetDefaultFactory(PinotBufferFactory) {...}
/** * Change the {@link PinotBufferFactory} used by the current thread. * * @see #loadDefaultFactory(PinotConfiguration) */publicstaticvoiduseFactory(PinotBufferFactoryfactory) {
_FACTORY.set(factory);
}
/** * Returns the factory the current thread should use. */publicstaticPinotBufferFactorygetFactory() {
PinotBufferFactorypinotBufferFactory = _FACTORY.get();
if (pinotBufferFactory == null) {
pinotBufferFactory = _defaultFactory;
}
returnpinotBufferFactory;
}
Then the static methods PinotDataBuffer.allocateDirect (and others) should delegate on PinotDataBuffer.getFactory(). By doing that, tests can call PinotDataBuffer.useFactory() in order to use, on that thread, the buffer library they want to test. Exploratory draft #10528 implements all of these and modifies tests to use this API.
Actual buffer implementations
This is an optional part of the PEP. What we request is to have the PinotBufferFactory SPI. This is what we would like to do with it, but it could be part of another PEP if necessary.
Given that we cannot use LArray in modern versions of Java, we have three alternatives :
Use other third party libraries
Implement our own library on top of Unsafe
Use Foreign Memory API
I tried to use Chronicle Bytes in the past with partial success, see #9842. Given that I don't know more third party buffer libraries that use long offsets (Netty and Agrona use ints), I explored the last two alternatives in #10528.
Foreign Memory API is clearly the future, as it provides exactly what we need: A high performance buffer API that is maintained (given that will be included in the JVM) and uses longs as offsets and sizes. But it has two issues:
Wait until it is stable (possible in Java 22 or 23, which won't be LTS, so maybe we need to wait 2 years until Java 25).
Use it as a preview in Java 21. This would imply that we could either run with Java 11+LArray or Java 21+Foreign, but do not support Java 17. Also, in Java 21 we would need to compile and start with --enable-preview.
Create our own library.
Previously mentioned #10528 does create two buffer implementations: One on top of Unsafe and another on top of Foreign Memory API. The latter is trivial to implement, but it is very difficult to include in the current Apache Pinot CI. The problems I found are:
Some Maven plugins don't work in Java 21-ea (at least spotless doesn't work)
In general, it is difficult to maintain in the same compilation unit code that has to be compiled with Java 11 and code that has to be compiled with Java 21
Presto still requires to use Java 8, which makes it even more difficult.
In case we decide to use my draft in the actual implementation of the PEP, I would suggest to remove the Foreign Memory API implementation from the branch and optionally move it to another github repo.
I've also modified BenchmarkPinotDataBuffer in order to run with the Unsafe based implementation. The modifications are in the draft.
I've run the benchmark in a M1 Pro and in a Ryzen 9 3900X with Ubuntu 22.10. The exact results of the benchmark can be found here but the following chars should be good enough:
Please note that LArray only runs in Java 11, so there are no data with LArray in Java 17 and 21.
This one shows the cost of executing batch writes (aka call PinotDataBuffer.readFrom) and batch read (aka call PinotDataBuffer.copyTo) with a byte array of 1024 elements (units are ns/op in Linux):
It is important to note that the implementation used in both LArray and Unsafe when dealing with batch reads and writes is to create a temporal direct byte buffer that points to the same address of the current buffer, so we can assume that the performance difference is due to the new instance creation.
This one shows the cost of executing non batch writes (calling PinotDataBuffer.putByte()) and reads (calling PinotDataBuffer.getByte) in a loop with 1024 consecutive offsets starting from a random one (units are ns/op in Linux):
In my opinion the latest is the most interesting, as it shows the improvements introduced in Java 17 and 21 JIT when dealing with loops.
I've also tried to add the Foreign Memory API implementation in the benchmark, but I found several problems so I gave up.
This is a PEP-Request. I propose to create a extensible SPI that can be used to change the PinotDataBuffer implementation used at runtime. Treat this issue as a clone of #9162, but more formal.
What needs to be done?
Right now PinotDataBuffer instances are obtained by calling static methods:
These methods have a static implementation that return a PinotDataBuffer that is backed by a ByteBuffer or by a LArray, depending on whether the requested size is greater than 2GBs or not. ByteBuffers are faster and more reliable than LArray, but they cannot be larger than Integer.MAX_INT (aka 2GB - 1byte).
Here I propose to change that implementation. Instead of having a static implementation, the algorithm used to instantiate the PinotDataBuffer will be delegated to an interface. The specific implementation will be set at Pinot startup time by reading the property
pinot.offheap.buffer.factory, which will be an optional property whose value is a String. In case it exists, it should the qualified class name of a Class that implementsPinotDataBufferFactory. There will be also another optional property calledpinot.offheap.prioritize.bytebufferwhich will be a boolean. If it is true, then the ByteBuffer implementation will be used when less than 2GB buffers are requested.Why the feature is needed
The reason to be able to have new PinotDataBuffer implementations is that LArray is not compatible with Java >= 16, as detected in #8529. It also seems that our LArray implementation may have some bugs (see #10774) and it doesn't seem that LArray will be updated (See xerial/larray#75), so we need to move on.
Initial idea/proposal
Adding a SPI is something we have done several times in the last months/years. Most of the times we use one or more configuration properties to decide which instance we need to use and set that instance in a static attribute. That system works, but it is problematic when we want to change the implementation in an isolated way (for example, when doing tests).
That is why I propose to extend this system with a thread local on top of that. Something like:
Then the static methods PinotDataBuffer.allocateDirect (and others) should delegate on PinotDataBuffer.getFactory(). By doing that, tests can call
PinotDataBuffer.useFactory()in order to use, on that thread, the buffer library they want to test. Exploratory draft #10528 implements all of these and modifies tests to use this API.Actual buffer implementations
This is an optional part of the PEP. What we request is to have the PinotBufferFactory SPI. This is what we would like to do with it, but it could be part of another PEP if necessary.
Given that we cannot use LArray in modern versions of Java, we have three alternatives :
I tried to use Chronicle Bytes in the past with partial success, see #9842. Given that I don't know more third party buffer libraries that use long offsets (Netty and Agrona use ints), I explored the last two alternatives in #10528.
Foreign Memory API is clearly the future, as it provides exactly what we need: A high performance buffer API that is maintained (given that will be included in the JVM) and uses longs as offsets and sizes. But it has two issues:
Therefore we have three options:
Previously mentioned #10528 does create two buffer implementations: One on top of Unsafe and another on top of Foreign Memory API. The latter is trivial to implement, but it is very difficult to include in the current Apache Pinot CI. The problems I found are:
In case we decide to use my draft in the actual implementation of the PEP, I would suggest to remove the Foreign Memory API implementation from the branch and optionally move it to another github repo.
I've also modified BenchmarkPinotDataBuffer in order to run with the Unsafe based implementation. The modifications are in the draft.
I've run the benchmark in a M1 Pro and in a Ryzen 9 3900X with Ubuntu 22.10. The exact results of the benchmark can be found here but the following chars should be good enough:
Please note that LArray only runs in Java 11, so there are no data with LArray in Java 17 and 21.
This one shows the cost of executing batch writes (aka call PinotDataBuffer.readFrom) and batch read (aka call PinotDataBuffer.copyTo) with a byte array of 1024 elements (units are ns/op in Linux):

It is important to note that the implementation used in both LArray and Unsafe when dealing with batch reads and writes is to create a temporal direct byte buffer that points to the same address of the current buffer, so we can assume that the performance difference is due to the new instance creation.
This one shows the cost of executing non batch writes (calling PinotDataBuffer.putByte()) and reads (calling PinotDataBuffer.getByte) in a loop with 1024 consecutive offsets starting from a random one (units are ns/op in Linux):

In my opinion the latest is the most interesting, as it shows the improvements introduced in Java 17 and 21 JIT when dealing with loops.
I've also tried to add the Foreign Memory API implementation in the benchmark, but I found several problems so I gave up.