Extensible PinotDataBuffer SPI

This is a PEP-Request. I propose to create a extensible SPI that can be used to change the PinotDataBuffer implementation used at runtime. Treat this issue as a clone of https://github.com/apache/pinot/issues/9162, but more formal.

# What needs to be done?

Right now PinotDataBuffer instances are obtained by calling static methods:
- PinotDataBuffer allocateDirect(long size, ByteOrder byteOrder, String description)
- PinotDataBuffer loadFile(File file, long offset, long size, ByteOrder byteOrder, String description)
- PinotDataBuffer mapFile(File file, boolean readOnly, long offset, long size, ByteOrder byteOrder, String description)

These methods have a static implementation that return a PinotDataBuffer that is backed by a ByteBuffer or by a LArray, depending on whether the requested size is greater than 2GBs or not. ByteBuffers are faster and more reliable than LArray, but they cannot be larger than Integer.MAX_INT (aka 2GB - 1byte).

Here I propose to change that implementation. Instead of having a static implementation, the algorithm used to instantiate the PinotDataBuffer will be delegated to an interface. The specific implementation will be set at Pinot startup time by reading the property `pinot.offheap.buffer.factory`, which will be an optional property whose value is a String. In case it exists, it should the qualified class name of a Class that implements `PinotDataBufferFactory`. There will be also another optional property called `pinot.offheap.prioritize.bytebuffer` which will be a boolean. If it is true, then the ByteBuffer implementation will be used when less than 2GB buffers are requested.

# Why the feature is needed 

The reason to be able to have new PinotDataBuffer implementations is that LArray is not compatible with Java >= 16, as detected in #8529. It also seems that our LArray implementation may have some bugs (see https://github.com/apache/pinot/pull/10774) and it doesn't seem that LArray will be updated (See https://github.com/xerial/larray/issues/75), so we need to move on.

# Initial idea/proposal

Adding a SPI is something we have done several times in the last months/years. Most of the times we use one or more configuration properties to decide which instance we need to use and set that instance in a static attribute. That system works, but it is problematic when we want to change the implementation in an isolated way (for example, when doing tests).

That is why I propose to extend this system with a thread local on top of that. Something like:

```java
  /**
   * The default {@link PinotBufferFactory} used by all threads that do not define their own factory.
   */
  private static PinotBufferFactory _defaultFactory = createDefaultFactory();
  /**
   * A thread local variable that can be used to customize the {@link PinotBufferFactory} used on tests. This is mostly
   * useful in tests.
   */
  private static final ThreadLocal<PinotBufferFactory> _FACTORY = new ThreadLocal<>();

  /**
   * Creates the default factory depending on the JVM version
   */
  public static PinotBufferFactory createDefaultFactory() {...}

  /**
   * Changes the default factory
   */
  public static void setDefaultFactory(PinotBufferFactory) {...}

  /**
   * Change the {@link PinotBufferFactory} used by the current thread.
   *
   * @see #loadDefaultFactory(PinotConfiguration)
   */
  public static void useFactory(PinotBufferFactory factory) {
    _FACTORY.set(factory);
  }

  /**
   * Returns the factory the current thread should use.
   */
  public static PinotBufferFactory getFactory() {
    PinotBufferFactory pinotBufferFactory = _FACTORY.get();
    if (pinotBufferFactory == null) {
      pinotBufferFactory = _defaultFactory;
    }
    return pinotBufferFactory;
  }
```

Then the static methods PinotDataBuffer.allocateDirect (and others) should delegate on PinotDataBuffer.getFactory(). By doing that, tests can call `PinotDataBuffer.useFactory()` in order to use, on that thread, the buffer library they want to test. Exploratory draft https://github.com/apache/pinot/pull/10528 implements all of these and modifies tests to use this API.

# Actual buffer implementations

This is an optional part of the PEP. What we request is to have the PinotBufferFactory SPI. This is what we would like to do with it, but it could be part of another PEP if necessary.

Given that we cannot use LArray in modern versions of Java, we have three alternatives :
- Use other third party libraries
- Implement our own library on top of Unsafe
- Use Foreign Memory API

I tried to use Chronicle Bytes in the past with partial success, see https://github.com/apache/pinot/pull/9842. Given that I don't know more third party buffer libraries that use long offsets (Netty and Agrona use ints), I explored the last two alternatives in https://github.com/apache/pinot/pull/10528.

Foreign Memory API is clearly the future, as it provides exactly what we need: A high performance buffer API that is maintained (given that will be included in the JVM) and uses longs as offsets and sizes. But it has two issues:
- It is not included in Java 17
- It is still in preview mode and it will continue like that in Java 21 (see [this post where one of the author explains the reasons and the changes since Java 20](https://github.com/minborg/articles/tree/jep442/2023/March/22-jep442-FFM-Third-Preview)).

Therefore we have three options:
1. Wait until it is stable (possible in Java 22 or 23, which won't be LTS, so maybe we need to wait 2 years until Java 25).
2. Use it as a preview in Java 21. This would imply that we could either run with Java 11+LArray or Java 21+Foreign, but do not support Java 17. Also, in Java 21 we would need to compile and start with --enable-preview.
3. Create our own library.

Previously mentioned https://github.com/apache/pinot/pull/10528 does create two buffer implementations: One on top of Unsafe and another on top of Foreign Memory API. The latter is trivial to implement, but it is very difficult to include in the current Apache Pinot CI. The problems I found are:
- Some Maven plugins don't work in Java 21-ea (at least spotless doesn't work)
- GitHub Actions cannot be configured to use Java 21-ea (see https://github.com/actions/setup-java/issues/492#issuecomment-1551508335).
- In general, it is difficult to maintain in the same compilation unit code that has to be compiled with Java 11 and code that has to be compiled with Java 21
- Presto still requires to use Java 8, which makes it even more difficult.

In case we decide to use my draft in the actual implementation of the PEP, I would suggest to remove the Foreign Memory API implementation from the branch and optionally move it to another github repo.

I've also modified BenchmarkPinotDataBuffer in order to run with the Unsafe based implementation. The modifications are in the draft.
I've run the benchmark in a M1 Pro and in a Ryzen 9 3900X with Ubuntu 22.10. The exact results of the benchmark can be found [here](https://docs.google.com/spreadsheets/d/1xgxR6_mMwq2zhe8ZkpR5X4VKLfvwyTqT6qFZsMkh49E/edit?usp=sharing) but the following chars should be good enough:

Please note that LArray only runs in Java 11, so there are no data with LArray in Java 17 and 21.

This one shows the cost of executing batch writes (aka call PinotDataBuffer.readFrom) and batch read (aka call PinotDataBuffer.copyTo) with a byte array of 1024 elements (units are ns/op in Linux):
![Batch](https://github.com/apache/pinot/assets/1913993/2f5c22fd-a673-4410-ae5f-039f9147bdce)

It is important to note that the implementation used in both LArray and Unsafe when dealing with batch reads and writes is to create a temporal direct byte buffer that points to the same address of the current buffer, so we can assume that the performance difference is due to the new instance creation.

This one shows the cost of executing non batch writes (calling PinotDataBuffer.putByte()) and reads (calling PinotDataBuffer.getByte) in a loop with 1024 consecutive offsets starting from a random one (units are ns/op in Linux):
![No batch](https://github.com/apache/pinot/assets/1913993/1fe73241-8ead-42f0-b5d8-b4fae426af75)

In my opinion the latest is the most interesting, as it shows the improvements introduced in Java 17 and 21 JIT when dealing with loops.

I've also tried to add the Foreign Memory API implementation in the benchmark, but I found several problems so I gave up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extensible PinotDataBuffer SPI #10783

What needs to be done?

Why the feature is needed

Initial idea/proposal

Actual buffer implementations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Extensible PinotDataBuffer SPI #10783

Description

What needs to be done?

Why the feature is needed

Initial idea/proposal

Actual buffer implementations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions