Skip to content

Conversation

@roee88
Copy link
Contributor

@roee88 roee88 commented Sep 2, 2021

Experimental C Data Interface support in Arrow Java for 64bit systems.

Roundtrip example with dictionaries:

  void roundtrip(FieldVector vector, DictionaryProvider provider) {
    // Consumer allocates empty structures
    try (ArrowSchema consumerArrowSchema = ArrowSchema.allocateNew(allocator);
        ArrowArray consumerArrowArray = ArrowArray.allocateNew(allocator)) {

      // Producer creates structures from existing memory pointers
      try (ArrowSchema arrowSchema = ArrowSchema.wrap(consumerArrowSchema.memoryAddress());
          ArrowArray arrowArray = ArrowArray.wrap(consumerArrowArray.memoryAddress())) {
        // Producer exports vector into the C Data Interface structures
        Data.exportVector(allocator, vector, provider, arrowArray, arrowSchema);
      }

      // Consumer imports vector (and dictionaries)
      try (CDataDictionaryProvider dictionaries = new CDataDictionaryProvider();
          FieldVector imported = Data.importVector(allocator, consumerArrowArray, consumerArrowSchema, dictionaries)) {
          // Do something with the imported vector
      }
    }
  }

@github-actions
Copy link

github-actions bot commented Sep 2, 2021

@github-actions
Copy link

github-actions bot commented Sep 2, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@jorgecarleitao
Copy link
Member

I skimmed through it and it looks good, thanks a lot for this!

What we did in Rust was to use the c data interface that C++ exposes in Python to make calls from within the process, see here.

I think it would be beneficial to add integration tests against pyarrow. In Rust we found a couple of memory leaks and double frees during development by testing against pyarrow / c++.

I would also setup an environment to run those tests against e.g. valgrind, since in FFI is very easy to trigger UB.

@roee88
Copy link
Contributor Author

roee88 commented Sep 2, 2021

I skimmed through it and it looks good, thanks a lot for this!

What we did in Rust was to use the c data interface that C++ exposes in Python to make calls from within the process, see here.

I think it would be beneficial to add integration tests against pyarrow. In Rust we found a couple of memory leaks and double frees during development by testing against pyarrow / c++.

Thanks Jorge. We have @tomersolomon1 working on integration tests against pyarrow as a follow-up to this PR. The approach that we are trying with integration tests is to use jpype (as suggested by @pitrou) and run the tests against the same datasets used in the IPC integration tests. It's too early to say if that makes sense. I think that eventually the same set of tests should be used for all languages.

I would also setup an environment to run those tests against e.g. valgrind, since in FFI is very easy to trigger UB.

I'm not familiar with using valgrind in Java but I will definitely check. FWIW the tests do fail in case of a memory leak for memory allocated with a buffer allocator (allocator#close() raises an exception if there are allocated bytes left).

Somewhat unrelated but I know that you ran tests with valgrind for arrow2, did you also ran it for arrow-rs? I thought that there is a memory leak in arrow-rs because no one ported your export API fixes yet. Just out of curiosity about my memory leak assumption.

@roee88 roee88 force-pushed the java-c-data-interface branch from cd8db17 to 09aee82 Compare September 2, 2021 20:04
@roee88 roee88 force-pushed the java-c-data-interface branch from 7de8004 to 3fed9e0 Compare September 2, 2021 20:23
install:
- Java 8 or later
- Maven 3.3 or later
- A C++ compiler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a requirement for the C++ compiler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to A C++11 enabled compiler. I don't know of any specific requirements and unsure how to check. My setup has gcc 9.3.0.

java/ffi/pom.xml Outdated
</parent>
<modelVersion>4.0.0</modelVersion>

<artifactId>arrow-ffi</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I am a little curious what does ffi stand for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been considering if abi could be able to describe what happens there better than ffi. I am not sure but in Java it seems that when it comes to foreign it might implie something related to project panama

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another naming candidate is bridge (from the cpp implementation). I will change to whatever is chosen.

Personally I don't associate FFI with project panama, it's a well known term way before project panama started. But I understand your point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep the naming doesn't seem to be blocking anyway. Though I was thinking that we might be interested in integrating functionalities from Panama into Arrow once we decide to upgrade to higher Java version (currently we are on 1.8). At that time it might be clearer to have distinction between the module ffi or other parts using Panama FFI. Another point is that we already have other modules using JNI which is also literally just a form of FFI. I think if we are introducing C data interface we may want users aim to data rather than function. Anyway, didn't want to block anything, just my 2 cents.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that arrow-ffi may not be good. It seems that FFI is too generic. If we use FFI, users may think that they can call any functions implemented in not Java. (General FFI library can do them.)
How about arrow-c-data and org.apache.arrow.c.Data?

java/ffi/pom.xml Outdated
</dependency>
<dependency>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-memory-netty</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use arrow-memory-unsafe instead? as we are in the process of reducing the dependency on netty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to have a compile time dependency on a specific implementation at all? I think netty is still the more mature allocator, but trying to keep as much code independent would be best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield it's a test dependency (test scope)

@pitrou
Copy link
Member

pitrou commented Sep 6, 2021

The approach that we are trying with integration tests is to use jpype (as suggested by @pitrou) and run the tests against the same datasets used in the IPC integration tests. It's too early to say if that makes sense.

That sounds reasonable to me. You can start from hand-written data generation in PyArrow (or Java) if that's easier, though.

ArrowArray.Snapshot snapshot = src.snapshot();
checkState(snapshot.release != NULL, "Cannot import released ArrowArray");
recursionLevel = parent.recursionLevel + 1;
checkState(recursionLevel < MAX_IMPORT_RECURSION_LEVEL, "Recursion level in ArrowArray struct exceeded");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be <=?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* </pre>
*/
public class ArrowArray implements BaseStruct {
private static final int SIZE_OF = 80;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size is 80 only in 64-bit system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

The PR message lists some questions and I would really appreciate it if someone could answer them explicitly. Support for 32bit is one of the questions. I wasn't sure if arrow targets it. If required in this PR then I will make the required changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Maybe we need some comment here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for sanity the import/export functions should error out on 32-bit systems (I assume this shouldn't be difficult to do?).

* @return A new ArrowSchema instance
*/
public static ArrowSchema allocateNew(BufferAllocator allocator) {
return new ArrowSchema(allocator.buffer(ArrowSchema.SIZE_OF));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to fill zeros to the newly allocated buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou any guidelines here? I think that at least the release callback should be zeroed for correctness in case of failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Is there a finalizer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a newly allocated ArrowSchema at consumer side. On the ImportField/ImportSchema methods it is released (similar to the cpp implementation).

The question is whether the consumer should zero all fields before the producer fills this structure. My assumption was that it's not a requirement.

However, for the case of a failed producer it might be better to zero the release field before the producer fills it to avoid crash on release at consumer side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the producer returns an error, then the consumer shouldn't release the half-filled ArrowSchema. So you shouldn't worry about this case.


void put(Dictionary dictionary) {
Dictionary previous = map.put(dictionary.getEncoding().getId(), dictionary);
if (previous != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is a little different from the C++ API?

  /// \brief Add a dictionary to the memo with a particular id. Returns
  /// KeyError if that dictionary already exists
  Status AddDictionary(int64_t id, const std::shared_ptr<ArrayData>& dictionary);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionaries in Java are implemented very differently than the cpp implementation. The approach we took (after struggling with alternatives) is having this FFIDictionatyProvider as a reusable owner of imported dictionaries. That is, you can create it once and import ArrowArray instances to it. On import of batches it is actually expected to import the same IDs multiple times (possibly with slightly different values due to deltas I believe but unsure).

See the mailing list thread for background https://lists.apache.org/x/thread.html/rd2aecfe5ad71a6f81240ed5c6f706b1a6b2f4a95b8dd5db515e5fceb@%3Cdev.arrow.apache.org%3E

}
}
// the new ref count should be >= 0
Preconditions.checkState(refCnt >= 0, "RefCnt has gone negative");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check can be moved forward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Please check if that's what you meant.

Preconditions.checkState(decrement >= 1, "ref count decrement should be greater than or equal to 1");
// decrement the ref count
final int refCnt;
synchronized (this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better performance, the scope of this lock can be reduced?
Only the following statements need the lock?

        struct.release();
        struct.close();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


@Override
public ArrowBuf deriveBuffer(ArrowBuf sourceBuffer, long index, long length) {
final long derivedBufferAddress = sourceBuffer.memoryAddress() + index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we directly use ArrowBuf#slice(long index, long length)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like slice is using deriveBuffer so it's not possible


static String asString(ArrowType arrowType) {
if (arrowType instanceof ExtensionType) {
arrowType = ((ExtensionType) arrowType).storageType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need a recursive call, if the storage type of the extension type is another extension type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


jint JNI_VERSION = JNI_VERSION_1_6;

class JniPendingException : public std::runtime_error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting error arrow/java/ffi/src/main/cpp/jni_wrapper.cc:49:78: error: expected class-name before ‘(’ tokenuntil explicitly including stdexcept in this file. (gcc 10.1.1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

#define ARROW_FLAG_NULLABLE 2
#define ARROW_FLAG_MAP_KEYS_SORTED 4

struct ArrowSchema {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to duplicate this file?

Copy link
Member

@pitrou pitrou Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to duplicate or copy these declarations around. They are in the spec:
https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions

Comment on lines 33 to 39
```
mkdir -p ./target/build/
pushd ./target/build/
cmake ../..
make
popd
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to previous comment... I am worried about if we introduce more complexity for keeping the independency of ffi cpp code. Would it be better to somehow reorganize related source files into cpp folder (maybe under jni/)? We can still include ffi's own lib into its jar file during mvn install.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the questions in the PR message. I am looking for guidance on where this code should be, should it be enabled by default and if not which flag should be used, should it be built as part of arrow cpp or independently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is entirely independent from Arrow C++, so it should live in the Arrow Java source tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to integrate it into the CI. Any guidelines on whether this should be behind a feature flag or not and whether the flag should be enabled by default?

Copy link
Contributor Author

@roee88 roee88 Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is entirely independent from Arrow C++, so it should live in the Arrow Java source tree.

@pitrou I believe (after trying) that this adds a lot of complexity that is already solved in the cpp side of the project (cross-platform builds, linting, packaging, etc.). I would like to propose adopting the suggestion by @zhztheplayer to move it to cpp/jni.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how you intend to do it exactly, but please kind in mind that the whole point of the C data interface is that you don't need Arrow C++ to use it. So even if this were in the Arrow C++ source tree, it should be completely standalone and buildable independently.


```
cd java
mvn -Parrow-ffi install
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So clean build is not allowed here right? As we already put all cpp build files to target/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to be a separate build directory outside target so it should be fine now. I had to add a .gitignore file with that folder otherwise the license check fails.

Comment on lines 105 to 107
if (private_data->vm_->GetEnv(reinterpret_cast<void**>(&env), JNI_VERSION) != JNI_OK) {
ThrowPendingException("JNIEnv was not attached to current thread");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use vm_->AttachCurrentThread instead? As the exported data might be imported and released from other threads that are not attached to JVM. We may allow this GetEnv routine in other modules but that was because in those modules we don't have to deal with native-created threads.

https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/invocation.html#AttachCurrentThread

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by adding a new JNIEnvGuard class.

@Override
public ArrowBuf retain(ArrowBuf srcBuffer, BufferAllocator targetAllocator) {
retain();
return srcBuffer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we make a trade-off between returning the source buffer or creating a new buffer here? I am not sure which way is better but it seems that in this way the source buffer will be shared to the target vector along with its internal states (read / write indexes) during buffer loading, although an allocator was assigned to that vector.

Also, here we may fail to register any buffer to the targetAllocator. But maybe it doesn't really hurt and be too tricky to improve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, here we may fail to register any buffer to the targetAllocator. But maybe it doesn't really hurt and be too tricky to improve.

I derived a new ArrowBuf but I can't think of any way to associate it with the given targetAllocator given that this is foreign memory and given the release management in the C Data Interface. FYI this was originally copied from the ORC implementation in arrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough to me. We don't have to associate it to the targetAllocator here.

* @return Array of pointer values as longs
*/
public static long[] toJavaArray(long arrayPtr, int size) {
if (size == 0 || arrayPtr == NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please note that an empty array and a null are different things

@pitrou
Copy link
Member

pitrou commented Sep 9, 2021

For the record, here's an example of testing against PyArrow in the Go PR:
https://github.com/apache/arrow/pull/11037/files#diff-7823a4f3ee456fdcc402770f66d886ce9946f4fd8ce465d32921fae9ee92dfe1

if (refCnt == 0) {
// refcount of this reference manager has dropped to 0
// release the underlying memory
synchronized (this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need to check refCnt == 0 again here?

Copy link
Contributor Author

@roee88 roee88 Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the counter is atomic, in what scenario is it needed?

I believe that if the counter is decremented to zero then it must be released. Incrementing the counter after (or during) shouldn't change the decision to release. I saw no special code for handling the increment-after-release case in other reference managers so I assume it's not possible by design. However, if desired I can add it just in case.

Copy link
Member

@zhztheplayer zhztheplayer Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need to check refCnt == 0 again here?

Literally speaking refCnt is a local variable, it doesn't change.

I think even we check using bufRefCnt.get(), that is still not enough. We don't lock writes from other places so the value might happen to be written from 0 to 1 to 0 during this block. That is an ABA problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roee88 BufferLedger has a check at line 185 and it seems that we don't have it here

@Override
public void retain(int increment) {
Preconditions.checkArgument(increment > 0, "retain(%s) argument is not positive", increment);
if (BaseAllocator.DEBUG) {
historicalLog.recordEvent("retain(%d)", increment);
}
final int originalReferenceCount = bufRefCnt.getAndAdd(increment);
Preconditions.checkArgument(originalReferenceCount > 0);
}

Since release(int decrement) and retain(int increment) are both public so I think one could call retain after calling release manually.

Copy link
Contributor Author

@roee88 roee88 Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@zhztheplayer can you please have a quick look at the OrcReferenceManager implementation to see if a ticket needs to be opened for a fix there.

Copy link
Member

@zhztheplayer zhztheplayer Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we may be able to move all the checks to a debug option (as well as in BufferLedger)? I am not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If by "check" you mean it writes a nice error message, then that sounds reasonable indeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Error message is not created here. We may need some.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a message. Can you check if the current implementation is correct? I think that we can omit the synchronized block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now looks correct to me. We don't seem to need the synchronization block so long as the block will be entered only once. Although in my opinion people can catch exceptions from #retain by their own. That's a rare case so we either remove the sync block or expand it to the same scope as in 'BufferLedger'.

@roee88 roee88 force-pushed the java-c-data-interface branch from a787a54 to cca1a84 Compare September 29, 2021 12:55
@emkornfield
Copy link
Contributor

Is this PR waiting for review or are updates needed from the author?

@roee88
Copy link
Contributor Author

roee88 commented Oct 11, 2021

@emkornfield We already addressed all of the review comments provided so far, but this PR was waiting for CI integration for unit testing and jar packaging. These were added in the last commit so a review of that part is required.

@pitrou pitrou requested a review from kou October 11, 2021 13:42
roee88 and others added 6 commits October 11, 2021 21:34
Signed-off-by: roee88 <roee88@gmail.com>
Signed-off-by: roee88 <roee88@gmail.com>
Signed-off-by: roee88 <roee88@gmail.com>
Signed-off-by: roee88 <roee88@gmail.com>
* Add testing of the ffi module to the JNI tests
* Add packaging of ffi module to java jars packaging

Signed-off-by: roee88 <roee88@gmail.com>
Co-authored-by: Doron Chen <CDORON@il.ibm.com>
Signed-off-by: roee88 <roee88@gmail.com>
@roee88 roee88 force-pushed the java-c-data-interface branch from 69016b3 to 99d7556 Compare October 11, 2021 18:38
@kou
Copy link
Member

kou commented Oct 11, 2021

@github-actions crossbow submit java-jars

@github-actions
Copy link

Revision: 99d7556

Submitted crossbow builds: ursacomputing/crossbow @ actions-913

Task Status
java-jars Github Actions

java/ffi/pom.xml Outdated
</parent>
<modelVersion>4.0.0</modelVersion>

<artifactId>arrow-ffi</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that arrow-ffi may not be good. It seems that FFI is too generic. If we use FFI, users may think that they can call any functions implemented in not Java. (General FFI library can do them.)
How about arrow-c-data and org.apache.arrow.c.Data?

fi

if [ "${ARROW_JAVA_FFI}" = "ON" ]; then
${mvn} -Darrow.ffi.cpp.build.dir=${ffi_build_dir} -Parrow-ffi install
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp confuses me with C++ implementation.
How about jni instead of cpp such as -Darrow.ffi.jni.build.dir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the previous suggestion about also renaming the package to c and the class to Data, I thought that this property should now be called arrow.c.jni.dist.dir. (arrow.c to be similar to arrow.vector.* and arrow.memory.* properties that start with the package/module name and jni.dist.dir to indicate that this is the directory with the shared libs). Is that OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, c may mislead C GLib implementation.
Generally, we use "GLib" or "C GLib" for C GLib implementation. Most of people will not think about C GLib implementation by c here.
And c means Arrow's C data interface here. (It doesn't mean C language is used for implementation.) It's natural that we use c here.
So I'm OK with arrow.c.jni.dist.dir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Package name: org.apache.arrow.c
  2. Class name: Data
  3. Maven profile name: arrow-c-data
  4. Shared library name: libarrow_cdata_jni
  5. Maven property for dir with the shared library: arrow.c.jni.dist.dir
  6. CI script for building the shared library: java_cdata_build.sh
  7. CI flag to enable building the shared library: ARROW_JAVA_CDATA=ON

Anything that should be different before I make the changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed accordingly

Comment on lines 1335 to 1336
/arrow/ci/scripts/java_ffi_build.sh /arrow /build/java/ffi/build /build/java/ffi &&
/arrow/ci/scripts/cpp_build.sh /arrow /build &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/arrow/ci/scripts/java_ffi_build.sh /arrow /build/java/ffi/build /build/java/ffi &&
/arrow/ci/scripts/cpp_build.sh /arrow /build &&
/arrow/ci/scripts/cpp_build.sh /arrow /build &&
/arrow/ci/scripts/java_ffi_build.sh /arrow /build/java/ffi/build /build/java/ffi &&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

project(arrow_ffi_java)

# Find java/jni
include(FindJava)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?
I think that find_package(Java) read the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# Find java/jni
include(FindJava)
include(UseJava)
include(FindJNI)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?
I think that find_package(JNI) read the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

target_link_libraries(arrow_ffi_jni ${JAVA_JVM_LIBRARY})
add_dependencies(arrow_ffi_jni ${PROJECT_NAME})

install(TARGETS arrow_ffi_jni DESTINATION ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that ${CMAKE_INSTALL_PREFIX}/ is needless here:

Suggested change
install(TARGETS arrow_ffi_jni DESTINATION ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR})
install(TARGETS arrow_ffi_jni DESTINATION ${CMAKE_INSTALL_LIBDIR})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

mkdir -p build
pushd build
cmake ..
make
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use cmake --build . here. It doesn't depend on make nor ninja.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: roee88 <roee88@gmail.com>
* Package name: org.apache.arrow.c
* Class name: Data
* Maven profile name: arrow-c-data
* Shared library name: libarrow_cdata_jni
* Maven property for dir with the shared library: arrow.c.jni.dist.dir
* CI script for building the shared library: java_cdata_build.sh
* CI flag to enable building the shared library: ARROW_JAVA_CDATA=ON

Signed-off-by: roee88 <roee88@gmail.com>
Signed-off-by: roee88 <roee88@gmail.com>
Signed-off-by: roee88 <roee88@gmail.com>
Signed-off-by: roee88 <roee88@gmail.com>
@roee88
Copy link
Contributor Author

roee88 commented Oct 12, 2021

@kou can you please crossbow submit java-jars again?

@kou
Copy link
Member

kou commented Oct 12, 2021

@github-actions crossbow submit java-jars

@github-actions
Copy link

Revision: 4fbb16f

Submitted crossbow builds: ursacomputing/crossbow @ actions-918

Task Status
java-jars Github Actions

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

CI failures are unrelated.

@kou
Copy link
Member

kou commented Oct 12, 2021

@roee88 Could you update description of this pull request for the latest code? Our merge tool uses the description as commit message. I'll merge this after you update the description.

Comment on lines 309 to 310
public static VectorSchemaRoot importVectorSchemaRoot(BufferAllocator allocator, ArrowSchema schema, ArrowArray array,
CDataDictionaryProvider provider) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all other methods the array parameter is before the schema parameter. In this method it's different and the original reason is that array is optional. However, while writing code that uses it I realized that it's easy to get it wrong due to the inconsistency. @kou is it okay if I swap it for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!
Please ping me when you think that this pull request is ready to merge. I'll merge this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou
Ready. Thanks 👍

Signed-off-by: Doron Chen <cdoron@il.ibm.com>

Co-authored-by: CDORON@il.ibm.com <cdoron@lnx-arrow.sl.cloud9.ibm.com>
@kou
Copy link
Member

kou commented Oct 13, 2021

@github-actions crossbow submit java-jars

@github-actions
Copy link

Revision: bff6410

Submitted crossbow builds: ursacomputing/crossbow @ actions-924

Task Status
java-jars Github Actions

@kou kou closed this in e379ee1 Oct 13, 2021
@ursabot
Copy link

ursabot commented Oct 13, 2021

Benchmark runs are scheduled for baseline = e2b1dd9 and contender = e379ee1. e379ee1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.68% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.54% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@roee88 roee88 deleted the java-c-data-interface branch December 5, 2021 21:23
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
Experimental C Data Interface support in Arrow Java for 64bit systems.

Roundtrip example with dictionaries:

```java
  void roundtrip(FieldVector vector, DictionaryProvider provider) {
    // Consumer allocates empty structures
    try (ArrowSchema consumerArrowSchema = ArrowSchema.allocateNew(allocator);
        ArrowArray consumerArrowArray = ArrowArray.allocateNew(allocator)) {

      // Producer creates structures from existing memory pointers
      try (ArrowSchema arrowSchema = ArrowSchema.wrap(consumerArrowSchema.memoryAddress());
          ArrowArray arrowArray = ArrowArray.wrap(consumerArrowArray.memoryAddress())) {
        // Producer exports vector into the C Data Interface structures
        Data.exportVector(allocator, vector, provider, arrowArray, arrowSchema);
      }

      // Consumer imports vector (and dictionaries)
      try (CDataDictionaryProvider dictionaries = new CDataDictionaryProvider();
          FieldVector imported = Data.importVector(allocator, consumerArrowArray, consumerArrowSchema, dictionaries)) {
          // Do something with the imported vector
      }
    }
  }
```

Closes apache#11067 from roee88/java-c-data-interface

Lead-authored-by: Roee Shlomo <roee88@gmail.com>
Co-authored-by: roee88 <roee88@gmail.com>
Co-authored-by: Doron Chen <cdoron@il.ibm.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants