ARROW-12965: [Java] C Data Interface implementation #11067

roee88 · 2021-09-02T13:14:50Z

Experimental C Data Interface support in Arrow Java for 64bit systems.

Roundtrip example with dictionaries:

  void roundtrip(FieldVector vector, DictionaryProvider provider) {
    // Consumer allocates empty structures
    try (ArrowSchema consumerArrowSchema = ArrowSchema.allocateNew(allocator);
        ArrowArray consumerArrowArray = ArrowArray.allocateNew(allocator)) {

      // Producer creates structures from existing memory pointers
      try (ArrowSchema arrowSchema = ArrowSchema.wrap(consumerArrowSchema.memoryAddress());
          ArrowArray arrowArray = ArrowArray.wrap(consumerArrowArray.memoryAddress())) {
        // Producer exports vector into the C Data Interface structures
        Data.exportVector(allocator, vector, provider, arrowArray, arrowSchema);
      }

      // Consumer imports vector (and dictionaries)
      try (CDataDictionaryProvider dictionaries = new CDataDictionaryProvider();
          FieldVector imported = Data.importVector(allocator, consumerArrowArray, consumerArrowSchema, dictionaries)) {
          // Do something with the imported vector
      }
    }
  }

github-actions · 2021-09-02T13:15:10Z

https://issues.apache.org/jira/browse/ARROW-12965

github-actions · 2021-09-02T13:15:11Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

jorgecarleitao · 2021-09-02T19:06:01Z

I skimmed through it and it looks good, thanks a lot for this!

What we did in Rust was to use the c data interface that C++ exposes in Python to make calls from within the process, see here.

I think it would be beneficial to add integration tests against pyarrow. In Rust we found a couple of memory leaks and double frees during development by testing against pyarrow / c++.

I would also setup an environment to run those tests against e.g. valgrind, since in FFI is very easy to trigger UB.

roee88 · 2021-09-02T19:40:17Z

I skimmed through it and it looks good, thanks a lot for this!

What we did in Rust was to use the c data interface that C++ exposes in Python to make calls from within the process, see here.

I think it would be beneficial to add integration tests against pyarrow. In Rust we found a couple of memory leaks and double frees during development by testing against pyarrow / c++.

Thanks Jorge. We have @tomersolomon1 working on integration tests against pyarrow as a follow-up to this PR. The approach that we are trying with integration tests is to use jpype (as suggested by @pitrou) and run the tests against the same datasets used in the IPC integration tests. It's too early to say if that makes sense. I think that eventually the same set of tests should be used for all languages.

I would also setup an environment to run those tests against e.g. valgrind, since in FFI is very easy to trigger UB.

I'm not familiar with using valgrind in Java but I will definitely check. FWIW the tests do fail in case of a memory leak for memory allocated with a buffer allocator (allocator#close() raises an exception if there are allocated bytes left).

Somewhat unrelated but I know that you ran tests with valgrind for arrow2, did you also ran it for arrow-rs? I thought that there is a memory leak in arrow-rs because no one ported your export API fixes yet. Just out of curiosity about my memory leak assumption.

liyafan82 · 2021-09-06T09:53:17Z

java/ffi/README.md

+install:
+ - Java 8 or later
+ - Maven 3.3 or later
+ - A C++ compiler


Do we have a requirement for the C++ compiler?

I updated to A C++11 enabled compiler. I don't know of any specific requirements and unsure how to check. My setup has gcc 9.3.0.

liyafan82 · 2021-09-06T09:57:00Z

java/ffi/pom.xml

+    </parent>
+    <modelVersion>4.0.0</modelVersion>
+
+    <artifactId>arrow-ffi</artifactId>


sorry I am a little curious what does ffi stand for?

Foreign function interface

I have been considering if abi could be able to describe what happens there better than ffi. I am not sure but in Java it seems that when it comes to foreign it might implie something related to project panama

Another naming candidate is bridge (from the cpp implementation). I will change to whatever is chosen.

Personally I don't associate FFI with project panama, it's a well known term way before project panama started. But I understand your point.

Yep the naming doesn't seem to be blocking anyway. Though I was thinking that we might be interested in integrating functionalities from Panama into Arrow once we decide to upgrade to higher Java version (currently we are on 1.8). At that time it might be clearer to have distinction between the module ffi or other parts using Panama FFI. Another point is that we already have other modules using JNI which is also literally just a form of FFI. I think if we are introducing C data interface we may want users aim to data rather than function. Anyway, didn't want to block anything, just my 2 cents.

I also think that arrow-ffi may not be good. It seems that FFI is too generic. If we use FFI, users may think that they can call any functions implemented in not Java. (General FFI library can do them.)
How about arrow-c-data and org.apache.arrow.c.Data?

liyafan82 · 2021-09-06T09:58:51Z

java/ffi/pom.xml

+        </dependency>
+        <dependency>
+            <groupId>org.apache.arrow</groupId>
+            <artifactId>arrow-memory-netty</artifactId>


can we use arrow-memory-unsafe instead? as we are in the process of reducing the dependency on netty.

is it necessary to have a compile time dependency on a specific implementation at all? I think netty is still the more mature allocator, but trying to keep as much code independent would be best.

@emkornfield it's a test dependency (test scope)

java/ffi/src/main/cpp/jni_wrapper.cc

pitrou · 2021-09-06T12:20:29Z

The approach that we are trying with integration tests is to use jpype (as suggested by @pitrou) and run the tests against the same datasets used in the IPC integration tests. It's too early to say if that makes sense.

That sounds reasonable to me. You can start from hand-written data generation in PyArrow (or Java) if that's easier, though.

liyafan82 · 2021-09-07T02:41:13Z

java/ffi/src/main/java/org/apache/arrow/ffi/ArrayImporter.java

+    ArrowArray.Snapshot snapshot = src.snapshot();
+    checkState(snapshot.release != NULL, "Cannot import released ArrowArray");
+    recursionLevel = parent.recursionLevel + 1;
+    checkState(recursionLevel < MAX_IMPORT_RECURSION_LEVEL, "Recursion level in ArrowArray struct exceeded");


It should be <=?

liyafan82 · 2021-09-07T02:58:56Z

java/ffi/src/main/java/org/apache/arrow/ffi/ArrowArray.java

+ * </pre>
+ */
+public class ArrowArray implements BaseStruct {
+  private static final int SIZE_OF = 80;


The size is 80 only in 64-bit system?

Yes.

The PR message lists some questions and I would really appreciate it if someone could answer them explicitly. Support for 32bit is one of the questions. I wasn't sure if arrow targets it. If required in this PR then I will make the required changes.

Thanks for the clarification. Maybe we need some comment here?

Also, for sanity the import/export functions should error out on 32-bit systems (I assume this shouldn't be difficult to do?).

liyafan82 · 2021-09-07T03:33:38Z

java/ffi/src/main/java/org/apache/arrow/ffi/ArrowSchema.java

+   * @return A new ArrowSchema instance
+   */
+  public static ArrowSchema allocateNew(BufferAllocator allocator) {
+    return new ArrowSchema(allocator.buffer(ArrowSchema.SIZE_OF));


do we need to fill zeros to the newly allocated buffer?

@pitrou any guidelines here? I think that at least the release callback should be zeroed for correctness in case of failure.

I don't understand. Is there a finalizer here?

This is a newly allocated ArrowSchema at consumer side. On the ImportField/ImportSchema methods it is released (similar to the cpp implementation).

The question is whether the consumer should zero all fields before the producer fills this structure. My assumption was that it's not a requirement.

However, for the case of a failed producer it might be better to zero the release field before the producer fills it to avoid crash on release at consumer side.

If the producer returns an error, then the consumer shouldn't release the half-filled ArrowSchema. So you shouldn't worry about this case.

liyafan82 · 2021-09-07T06:39:01Z

java/ffi/src/main/java/org/apache/arrow/ffi/FFIDictionaryProvider.java

+
+  void put(Dictionary dictionary) {
+    Dictionary previous = map.put(dictionary.getEncoding().getId(), dictionary);
+    if (previous != null) {


The behavior is a little different from the C++ API?

/// \brief Add a dictionary to the memo with a particular id. Returns /// KeyError if that dictionary already exists Status AddDictionary(int64_t id, const std::shared_ptr<ArrayData>& dictionary);

Dictionaries in Java are implemented very differently than the cpp implementation. The approach we took (after struggling with alternatives) is having this FFIDictionatyProvider as a reusable owner of imported dictionaries. That is, you can create it once and import ArrowArray instances to it. On import of batches it is actually expected to import the same IDs multiple times (possibly with slightly different values due to deltas I believe but unsure).

See the mailing list thread for background https://lists.apache.org/x/thread.html/rd2aecfe5ad71a6f81240ed5c6f706b1a6b2f4a95b8dd5db515e5fceb@%3Cdev.arrow.apache.org%3E

liyafan82 · 2021-09-07T06:47:28Z

java/ffi/src/main/java/org/apache/arrow/ffi/FFIReferenceManager.java

+      }
+    }
+    // the new ref count should be >= 0
+    Preconditions.checkState(refCnt >= 0, "RefCnt has gone negative");


This check can be moved forward?

Done. Please check if that's what you meant.

liyafan82 · 2021-09-07T06:48:49Z

java/ffi/src/main/java/org/apache/arrow/ffi/FFIReferenceManager.java

+    Preconditions.checkState(decrement >= 1, "ref count decrement should be greater than or equal to 1");
+    // decrement the ref count
+    final int refCnt;
+    synchronized (this) {


For better performance, the scope of this lock can be reduced?
Only the following statements need the lock?

struct.release(); struct.close();

liyafan82 · 2021-09-07T06:51:17Z

java/ffi/src/main/java/org/apache/arrow/ffi/FFIReferenceManager.java

+
+  @Override
+  public ArrowBuf deriveBuffer(ArrowBuf sourceBuffer, long index, long length) {
+    final long derivedBufferAddress = sourceBuffer.memoryAddress() + index;


Can we directly use ArrowBuf#slice(long index, long length)?

It seems like slice is using deriveBuffer so it's not possible

liyafan82 · 2021-09-07T08:57:29Z

java/ffi/src/main/java/org/apache/arrow/ffi/Format.java

+
+  static String asString(ArrowType arrowType) {
+    if (arrowType instanceof ExtensionType) {
+      arrowType = ((ExtensionType) arrowType).storageType();


Here we need a recursive call, if the storage type of the extension type is another extension type?

zhztheplayer · 2021-09-07T06:08:55Z

java/ffi/src/main/cpp/jni_wrapper.cc

+
+  jint JNI_VERSION = JNI_VERSION_1_6;
+
+  class JniPendingException : public std::runtime_error


I was getting error arrow/java/ffi/src/main/cpp/jni_wrapper.cc:49:78: error: expected class-name before ‘(’ tokenuntil explicitly including stdexcept in this file. (gcc 10.1.1)

zhztheplayer · 2021-09-07T06:18:37Z

java/ffi/src/main/cpp/abi.h

+#define ARROW_FLAG_NULLABLE 2
+#define ARROW_FLAG_MAP_KEYS_SORTED 4
+
+struct ArrowSchema {


Do we have to duplicate this file?

It's fine to duplicate or copy these declarations around. They are in the spec:
https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions

zhztheplayer · 2021-09-07T06:40:46Z

java/ffi/README.md

+```
+mkdir -p ./target/build/
+pushd ./target/build/
+cmake ../..
+make
+popd
+```


Similar to previous comment... I am worried about if we introduce more complexity for keeping the independency of ffi cpp code. Would it be better to somehow reorganize related source files into cpp folder (maybe under jni/)? We can still include ffi's own lib into its jar file during mvn install.

This is one of the questions in the PR message. I am looking for guidance on where this code should be, should it be enabled by default and if not which flag should be used, should it be built as part of arrow cpp or independently.

This is entirely independent from Arrow C++, so it should live in the Arrow Java source tree.

I would like to integrate it into the CI. Any guidelines on whether this should be behind a feature flag or not and whether the flag should be enabled by default?

This is entirely independent from Arrow C++, so it should live in the Arrow Java source tree.

@pitrou I believe (after trying) that this adds a lot of complexity that is already solved in the cpp side of the project (cross-platform builds, linting, packaging, etc.). I would like to propose adopting the suggestion by @zhztheplayer to move it to cpp/jni.

I'm not sure how you intend to do it exactly, but please kind in mind that the whole point of the C data interface is that you don't need Arrow C++ to use it. So even if this were in the Arrow C++ source tree, it should be completely standalone and buildable independently.

zhztheplayer · 2021-09-07T06:53:57Z

java/ffi/README.md

+
+```
+cd java
+mvn -Parrow-ffi install


So clean build is not allowed here right? As we already put all cpp build files to target/.

I changed it to be a separate build directory outside target so it should be fine now. I had to add a .gitignore file with that folder otherwise the license check fails.

zhztheplayer · 2021-09-07T09:43:39Z

java/ffi/src/main/cpp/jni_wrapper.cc

+    if (private_data->vm_->GetEnv(reinterpret_cast<void**>(&env), JNI_VERSION) != JNI_OK) {
+      ThrowPendingException("JNIEnv was not attached to current thread");
+    }


Would it be better to use vm_->AttachCurrentThread instead? As the exported data might be imported and released from other threads that are not attached to JVM. We may allow this GetEnv routine in other modules but that was because in those modules we don't have to deal with native-created threads.

https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/invocation.html#AttachCurrentThread

Done by adding a new JNIEnvGuard class.

zhztheplayer · 2021-09-07T10:23:47Z

java/ffi/src/main/java/org/apache/arrow/ffi/FFIReferenceManager.java

+  @Override
+  public ArrowBuf retain(ArrowBuf srcBuffer, BufferAllocator targetAllocator) {
+    retain();
+    return srcBuffer;


Did we make a trade-off between returning the source buffer or creating a new buffer here? I am not sure which way is better but it seems that in this way the source buffer will be shared to the target vector along with its internal states (read / write indexes) during buffer loading, although an allocator was assigned to that vector.

Also, here we may fail to register any buffer to the targetAllocator. But maybe it doesn't really hurt and be too tricky to improve.

Also, here we may fail to register any buffer to the targetAllocator. But maybe it doesn't really hurt and be too tricky to improve.

I derived a new ArrowBuf but I can't think of any way to associate it with the given targetAllocator given that this is foreign memory and given the release management in the C Data Interface. FYI this was originally copied from the ORC implementation in arrow.

Fair enough to me. We don't have to associate it to the targetAllocator here.

liyafan82 · 2021-09-09T06:44:14Z

java/ffi/src/main/java/org/apache/arrow/ffi/NativeUtil.java

+   * @return Array of pointer values as longs
+   */
+  public static long[] toJavaArray(long arrayPtr, int size) {
+    if (size == 0 || arrayPtr == NULL) {


nit: please note that an empty array and a null are different things

pitrou · 2021-09-09T14:49:47Z

For the record, here's an example of testing against PyArrow in the Go PR:
https://github.com/apache/arrow/pull/11037/files#diff-7823a4f3ee456fdcc402770f66d886ce9946f4fd8ce465d32921fae9ee92dfe1

liyafan82 · 2021-09-10T02:51:53Z

java/ffi/src/main/java/org/apache/arrow/ffi/FFIReferenceManager.java

+    if (refCnt == 0) {
+      // refcount of this reference manager has dropped to 0
+      // release the underlying memory
+      synchronized (this) {


maybe we need to check refCnt == 0 again here?

Given that the counter is atomic, in what scenario is it needed?

I believe that if the counter is decremented to zero then it must be released. Incrementing the counter after (or during) shouldn't change the decision to release. I saw no special code for handling the increment-after-release case in other reference managers so I assume it's not possible by design. However, if desired I can add it just in case.

maybe we need to check refCnt == 0 again here?

Literally speaking refCnt is a local variable, it doesn't change.

I think even we check using bufRefCnt.get(), that is still not enough. We don't lock writes from other places so the value might happen to be written from 0 to 1 to 0 during this block. That is an ABA problem.

@roee88 BufferLedger has a check at line 185 and it seems that we don't have it here

arrow/java/memory/memory-core/src/main/java/org/apache/arrow/memory/BufferLedger.java

Lines 178 to 186 in 3bbec3f

@Override

public void retain(int increment) {

Preconditions.checkArgument(increment > 0, "retain(%s) argument is not positive", increment);

if (BaseAllocator.DEBUG) {

historicalLog.recordEvent("retain(%d)", increment);

}

final int originalReferenceCount = bufRefCnt.getAndAdd(increment);

Preconditions.checkArgument(originalReferenceCount > 0);

}

Since release(int decrement) and retain(int increment) are both public so I think one could call retain after calling release manually.

Thanks.

@zhztheplayer can you please have a quick look at the OrcReferenceManager implementation to see if a ticket needs to be opened for a fix there.

Also we may be able to move all the checks to a debug option (as well as in BufferLedger)? I am not sure.

If by "check" you mean it writes a nice error message, then that sounds reasonable indeed.

Yes. Error message is not created here. We may need some.

I'll add a message. Can you check if the current implementation is correct? I think that we can omit the synchronized block.

It now looks correct to me. We don't seem to need the synchronization block so long as the block will be entered only once. Although in my opinion people can catch exceptions from #retain by their own. That's a rare case so we either remove the sync block or expand it to the same scope as in 'BufferLedger'.

emkornfield · 2021-10-10T21:56:31Z

Is this PR waiting for review or are updates needed from the author?

roee88 · 2021-10-11T05:54:36Z

@emkornfield We already addressed all of the review comments provided so far, but this PR was waiting for CI integration for unit testing and jar packaging. These were added in the last commit so a review of that part is required.

Signed-off-by: roee88 <roee88@gmail.com>

* Add testing of the ffi module to the JNI tests * Add packaging of ffi module to java jars packaging Signed-off-by: roee88 <roee88@gmail.com> Co-authored-by: Doron Chen <CDORON@il.ibm.com>

Signed-off-by: roee88 <roee88@gmail.com>

kou · 2021-10-11T20:02:05Z

@github-actions crossbow submit java-jars

github-actions · 2021-10-11T20:03:00Z

Revision: 99d7556

Submitted crossbow builds: ursacomputing/crossbow @ actions-913

Task	Status
java-jars

kou · 2021-10-11T20:24:45Z

java/ffi/pom.xml

+    </parent>
+    <modelVersion>4.0.0</modelVersion>
+
+    <artifactId>arrow-ffi</artifactId>


I also think that arrow-ffi may not be good. It seems that FFI is too generic. If we use FFI, users may think that they can call any functions implemented in not Java. (General FFI library can do them.)
How about arrow-c-data and org.apache.arrow.c.Data?

kou · 2021-10-11T20:28:05Z

ci/scripts/java_build.sh

 fi

+if [ "${ARROW_JAVA_FFI}" = "ON" ]; then
+  ${mvn} -Darrow.ffi.cpp.build.dir=${ffi_build_dir} -Parrow-ffi install


cpp confuses me with C++ implementation.
How about jni instead of cpp such as -Darrow.ffi.jni.build.dir?

With the previous suggestion about also renaming the package to c and the class to Data, I thought that this property should now be called arrow.c.jni.dist.dir. (arrow.c to be similar to arrow.vector.* and arrow.memory.* properties that start with the package/module name and jni.dist.dir to indicate that this is the directory with the shared libs). Is that OK?

Ah, c may mislead C GLib implementation.
Generally, we use "GLib" or "C GLib" for C GLib implementation. Most of people will not think about C GLib implementation by c here.
And c means Arrow's C data interface here. (It doesn't mean C language is used for implementation.) It's natural that we use c here.
So I'm OK with arrow.c.jni.dist.dir.

Package name: org.apache.arrow.c

Class name: Data

Maven profile name: arrow-c-data

Shared library name: libarrow_cdata_jni

Maven property for dir with the shared library: arrow.c.jni.dist.dir

CI script for building the shared library: java_cdata_build.sh

CI flag to enable building the shared library: ARROW_JAVA_CDATA=ON

Anything that should be different before I make the changes?

I like the names.

Changed accordingly

kou · 2021-10-11T20:31:52Z

docker-compose.yml

+        /arrow/ci/scripts/java_ffi_build.sh /arrow /build/java/ffi/build /build/java/ffi &&
        /arrow/ci/scripts/cpp_build.sh /arrow /build &&


Suggested change

/arrow/ci/scripts/java_ffi_build.sh /arrow /build/java/ffi/build /build/java/ffi &&

/arrow/ci/scripts/cpp_build.sh /arrow /build &&

/arrow/ci/scripts/cpp_build.sh /arrow /build &&

/arrow/ci/scripts/java_ffi_build.sh /arrow /build/java/ffi/build /build/java/ffi &&

kou · 2021-10-11T20:33:34Z

java/ffi/CMakeLists.txt

+project(arrow_ffi_java)
+
+# Find java/jni
+include(FindJava)


Do we need this?
I think that find_package(Java) read the file.

kou · 2021-10-11T20:33:43Z

java/ffi/CMakeLists.txt

+# Find java/jni
+include(FindJava)
+include(UseJava)
+include(FindJNI)


Do we need this?
I think that find_package(JNI) read the file.

kou · 2021-10-11T20:35:48Z

java/ffi/CMakeLists.txt

+target_link_libraries(arrow_ffi_jni ${JAVA_JVM_LIBRARY})
+add_dependencies(arrow_ffi_jni ${PROJECT_NAME})
+
+install(TARGETS arrow_ffi_jni DESTINATION ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR})


I think that ${CMAKE_INSTALL_PREFIX}/ is needless here:

Suggested change

install(TARGETS arrow_ffi_jni DESTINATION ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR})

install(TARGETS arrow_ffi_jni DESTINATION ${CMAKE_INSTALL_LIBDIR})

kou · 2021-10-11T20:37:05Z

java/ffi/README.md

+mkdir -p build
+pushd build
+cmake ..
+make


We can use cmake --build . here. It doesn't depend on make nor ninja.

Signed-off-by: roee88 <roee88@gmail.com>

* Package name: org.apache.arrow.c * Class name: Data * Maven profile name: arrow-c-data * Shared library name: libarrow_cdata_jni * Maven property for dir with the shared library: arrow.c.jni.dist.dir * CI script for building the shared library: java_cdata_build.sh * CI flag to enable building the shared library: ARROW_JAVA_CDATA=ON Signed-off-by: roee88 <roee88@gmail.com>

Signed-off-by: roee88 <roee88@gmail.com>

roee88 · 2021-10-12T11:44:56Z

@kou can you please crossbow submit java-jars again?

kou · 2021-10-12T12:10:01Z

@github-actions crossbow submit java-jars

github-actions · 2021-10-12T12:10:58Z

Revision: 4fbb16f

Submitted crossbow builds: ursacomputing/crossbow @ actions-918

Task	Status
java-jars

kou

+1

CI failures are unrelated.

kou · 2021-10-12T19:58:50Z

@roee88 Could you update description of this pull request for the latest code? Our merge tool uses the description as commit message. I'll merge this after you update the description.

roee88 · 2021-10-12T21:02:10Z

java/c/src/main/java/org/apache/arrow/c/Data.java

+  public static VectorSchemaRoot importVectorSchemaRoot(BufferAllocator allocator, ArrowSchema schema, ArrowArray array,
+      CDataDictionaryProvider provider) {


In all other methods the array parameter is before the schema parameter. In this method it's different and the original reason is that array is optional. However, while writing code that uses it I realized that it's easy to get it wrong due to the inconsistency. @kou is it okay if I swap it for consistency?

Sure!
Please ping me when you think that this pull request is ready to merge. I'll merge this.

@kou
Ready. Thanks 👍

Signed-off-by: Doron Chen <cdoron@il.ibm.com> Co-authored-by: CDORON@il.ibm.com <cdoron@lnx-arrow.sl.cloud9.ibm.com>

kou · 2021-10-13T07:01:29Z

@github-actions crossbow submit java-jars

github-actions · 2021-10-13T07:02:22Z

Revision: bff6410

Submitted crossbow builds: ursacomputing/crossbow @ actions-924

Task	Status
java-jars

ursabot · 2021-10-13T20:02:18Z

Benchmark runs are scheduled for baseline = e2b1dd9 and contender = e379ee1. e379ee1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.68% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.54% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Experimental C Data Interface support in Arrow Java for 64bit systems. Roundtrip example with dictionaries: ```java void roundtrip(FieldVector vector, DictionaryProvider provider) { // Consumer allocates empty structures try (ArrowSchema consumerArrowSchema = ArrowSchema.allocateNew(allocator); ArrowArray consumerArrowArray = ArrowArray.allocateNew(allocator)) { // Producer creates structures from existing memory pointers try (ArrowSchema arrowSchema = ArrowSchema.wrap(consumerArrowSchema.memoryAddress()); ArrowArray arrowArray = ArrowArray.wrap(consumerArrowArray.memoryAddress())) { // Producer exports vector into the C Data Interface structures Data.exportVector(allocator, vector, provider, arrowArray, arrowSchema); } // Consumer imports vector (and dictionaries) try (CDataDictionaryProvider dictionaries = new CDataDictionaryProvider(); FieldVector imported = Data.importVector(allocator, consumerArrowArray, consumerArrowSchema, dictionaries)) { // Do something with the imported vector } } } ``` Closes apache#11067 from roee88/java-c-data-interface Lead-authored-by: Roee Shlomo <roee88@gmail.com> Co-authored-by: roee88 <roee88@gmail.com> Co-authored-by: Doron Chen <cdoron@il.ibm.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

github-actions bot added the Component: Java label Sep 2, 2021

roee88 force-pushed the java-c-data-interface branch from cd8db17 to 09aee82 Compare September 2, 2021 20:04

pitrou requested review from emkornfield and liyafan82 September 2, 2021 20:09

roee88 force-pushed the java-c-data-interface branch from 7de8004 to 3fed9e0 Compare September 2, 2021 20:23

liyafan82 reviewed Sep 6, 2021

View reviewed changes

java/ffi/src/main/cpp/jni_wrapper.cc Outdated Show resolved Hide resolved

liyafan82 reviewed Sep 7, 2021

View reviewed changes

zhztheplayer reviewed Sep 7, 2021

View reviewed changes

liyafan82 reviewed Sep 9, 2021

View reviewed changes

liyafan82 reviewed Sep 10, 2021

View reviewed changes

roee88 force-pushed the java-c-data-interface branch from a787a54 to cca1a84 Compare September 29, 2021 12:55

pitrou requested a review from kou October 11, 2021 13:42

roee88 and others added 6 commits October 11, 2021 21:34

Removed redundant synchronized

dbab35e

Signed-off-by: roee88 <roee88@gmail.com>

Added comment about lack of support for 32bit systems

c23e1c4

Signed-off-by: roee88 <roee88@gmail.com>

fix: StructVector with inner complex type

97db8ea

Signed-off-by: roee88 <roee88@gmail.com>

Improve retain after release check

a308128

Signed-off-by: roee88 <roee88@gmail.com>

Java c data interface CI testing packaging

dee4fbd

* Add testing of the ffi module to the JNI tests * Add packaging of ffi module to java jars packaging Signed-off-by: roee88 <roee88@gmail.com> Co-authored-by: Doron Chen <CDORON@il.ibm.com>

Code style

99d7556

Signed-off-by: roee88 <roee88@gmail.com>

roee88 force-pushed the java-c-data-interface branch from 69016b3 to 99d7556 Compare October 11, 2021 18:38

kou reviewed Oct 11, 2021

View reviewed changes

roee88 added 5 commits October 12, 2021 08:33

Fix java_full_build script

0f0ac3c

Signed-off-by: roee88 <roee88@gmail.com>

Removed redundant lines in CMakeLists.txt

3345641

Signed-off-by: roee88 <roee88@gmail.com>

Add best effort to error on 32-bit systems

9412dab

Signed-off-by: roee88 <roee88@gmail.com>

code style: missing end of line

4fbb16f

Signed-off-by: roee88 <roee88@gmail.com>

jimexist mentioned this pull request Oct 12, 2021

add a Java wrapper for datafusion apache/datafusion#1108

Closed

kou approved these changes Oct 12, 2021

View reviewed changes

roee88 commented Oct 12, 2021

View reviewed changes

swap order of ArrowArray and ArrowSchema parameters (#16)

bff6410

Signed-off-by: Doron Chen <cdoron@il.ibm.com> Co-authored-by: CDORON@il.ibm.com <cdoron@lnx-arrow.sl.cloud9.ibm.com>

kou closed this in e379ee1 Oct 13, 2021

roee88 deleted the java-c-data-interface branch December 5, 2021 21:23

asfimport mentioned this pull request Jun 27, 2022

[Java] Java implementation of Arrow C data interface #28685

Closed


		jint JNI_VERSION = JNI_VERSION_1_6;

		class JniPendingException : public std::runtime_error

	@Override
	public void retain(int increment) {
	Preconditions.checkArgument(increment > 0, "retain(%s) argument is not positive", increment);
	if (BaseAllocator.DEBUG) {
	historicalLog.recordEvent("retain(%d)", increment);
	}
	final int originalReferenceCount = bufRefCnt.getAndAdd(increment);
	Preconditions.checkArgument(originalReferenceCount > 0);
	}

		/arrow/ci/scripts/java_ffi_build.sh /arrow /build/java/ffi/build /build/java/ffi &&
		/arrow/ci/scripts/cpp_build.sh /arrow /build &&

	install(TARGETS arrow_ffi_jni DESTINATION ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR})
	install(TARGETS arrow_ffi_jni DESTINATION ${CMAKE_INSTALL_LIBDIR})

		public static VectorSchemaRoot importVectorSchemaRoot(BufferAllocator allocator, ArrowSchema schema, ArrowArray array,
		CDataDictionaryProvider provider) {

ARROW-12965: [Java] C Data Interface implementation #11067

ARROW-12965: [Java] C Data Interface implementation #11067

Uh oh!

Conversation

roee88 commented Sep 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 2, 2021

Uh oh!

github-actions bot commented Sep 2, 2021

Uh oh!

jorgecarleitao commented Sep 2, 2021

Uh oh!

roee88 commented Sep 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou commented Sep 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roee88 commented Sep 2, 2021 •

edited

Loading

pitrou Sep 7, 2021 •

edited

Loading

roee88 Sep 13, 2021 •

edited

Loading