ARROW-10111: [Rust] Added new crate with code that consumes C Data interface #8287

jorgecarleitao · 2020-09-27T14:36:04Z

This PR:

adds a crate where we can use Rust + Python + Pyarrow together to perform round-trips between them.
adds the C Data Interface as a Rust struct, so that we can pass memory addresses around.

github-actions · 2020-09-27T14:46:47Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2020-09-27T15:05:33Z

https://issues.apache.org/jira/browse/ARROW-10111

rust/arrow/src/array/ffi.rs

pitrou · 2020-09-28T14:54:17Z

As a general design suggestion, on the C++ side we have separate structs (classes) for the import (consumer) side and the export (producer) side. Just 2 cents, though.

jorgecarleitao · 2020-09-28T16:59:06Z

As a general design suggestion, on the C++ side we have separate structs (classes) for the import (consumer) side and the export (producer) side. Just 2 cents, though.

Can you explain the rational for the two structs? Is it due to different ownership rules?

The design so far (for consumption) in this PR:

Have two structs (FFI_ArrowArray and FFI_ArrowSchema) that have ABI compatibility with ArrowArray and ArrowSchema struct in the C data interface.
Have one struct that owns both FFI_ArrowSchema and FFI_ArrowArray, that knows how to convert from and to Rust's implementation of an Arrow Array.

I am still a bit lost as to who owns what: when we pass the pointer from C to Rust, does C assume that it should not free the resources if it goes out of scope? I.e. what is the contract between the consumer and the producer with respect to whom should free memory (rust's by calling the "free")? Or is the contract: both check for pointer nullability of the release and call it accordingly. If so, what about threat safety?

pitrou · 2020-09-28T17:18:08Z

Having FFI_ArrowSchema and FFI_ArrowArray is fine, however they should probably minimally mirror the C struct and nothing else.

In addition to those two low-level structs, you need specific structs for importing and exporting.

When exporting, you need an allocated struct that gets stored in the void* private_data, and that will store ownership data (for example, a reference count). Your release callback will then dereferences the private_data pointer and release ownership (for example, by decrementing the reference count).

You can see the C++ array release callback here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L482-L502
Note the last line:

  delete reinterpret_cast<ExportedArrayPrivateData*>(array->private_data);

This is taking the private_data, interpreting it as a pointer to the ExportedArrayPrivateData struct, and destroying it. The ExportedArrayPrivateData contains a reference-counted pointer to ArrayData, which is the actual C++ Arrow class containing the data.

When importing, you need a struct that contains the FFI_ArrowArray (if importing an array). That struct needs to be kept alive by the corresponding Rust Arrow Array (for example, in C++ we have a Buffer class that can be subclassed for different kinds of buffers: allocated by C++, allocated by Python...) (*). When the struct's Rust destructor is called, it should call the embedded FFI_ArrowArray's release callback.

If you choose to manage ownership through buffers, since an array will have several buffers you probably want your importing struct to reference-count the FFI_ArrowArray (so that it will be released when all buffers are destroyed). I've found this example that might be useful: https://users.rust-lang.org/t/how-to-safely-deal-with-c-uint8-t-or-char-in-rust/43109/6

You can see the exact C++ equivalent here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L1144-L1177

the ImportedArrayData is a simple wrapper struct around ArrowArray with an additional destructor that calls the release callback
the ImportedBuffer has a reference-counted pointer to ImportedArrayData, such that the last buffer depending on a ImportedArrayData will release it when it disappears

Please tell me if that's clear, or don't hesitate to ask more questions :-)

jorgecarleitao · 2020-09-30T05:53:19Z

To set expectations right, IMO this is a very difficult task.

IMO rhere are at the moment 3 issues:

buffer slices aka buffer `offset` aka `parent_` buffer

Rust and C++ use slightly different approaches to slicing buffers.

In C++, we assign a parent_ buffer whenever we slice a buffer.
In Rust, the raw data is known as a BufferData, and a buffer is composed by a BufferData and an offset (into the data)

In both cases, memory management is tricky. If we slice a buffer from C++ and export it, does it knows that it cannot release the content?

Specifically:

create a buffer A in C++
slice it into buffer B (which now has a parent -> A)
export B to Rust
Rust calls B->release
C++ access the contents (via A) on a region that overlaps with B (UB?)

I was unable to find any reference to the buffer's parent in bridge.cc, nor any shared pointer to the sliced region.

I am asking because we have an analogous problem in Rust, but in Rust we use a shared point to memory region (BufferData), which I think protect us from this behavior. Specifically, a Buffer is rust is composed by:

an Arc to the actual region
an offset of where to start from in that region (non-zero in slicing)

Rust's implementation of Dictionary arrays

I think that Rust's implementation of dictionaries is difficult to bridge with C, as it assumes dictionary data is owned by a struct that is not ArrayData. I think that we will need to address this first. I raised this in ARROW-10128

Threading

How do we handle threads? We mutex the release?

pitrou · 2020-09-30T10:11:40Z

If we slice a buffer from C++ and export it, does it knows that it cannot release the content?

Yes, it does. I think you're overestimating the difficulty here, it's actually quite simple. In C++, Buffers (and slices, which are Buffers) are reference-counted (using the shared_ptr class). To take your example:

A is reference-counted (at the start, the reference count is 1 since C++ holds a reference)
B is reference-counted, and also increments A's reference count (through the parent pointer)
the exported C data increments B's reference count (through the allocated private_data)
when Rust releases the C data, the release callback decrements B's reference count (by destroying the private_data)
if B's reference count has dropped to 0, it is destroyed, which also decrements A's reference count (through parent)
as long as C++ has a strong reference to A, its reference count is >= 1, so it isn't destroyed

And anyway, Rust shouldn't worry about what happens on the C++ side. You're getting a ArrowArray struct, which may come from C++, but may also come from something else (e.g. DuckDB has started supporting the C data interface).

I think that Rust's implementation of dictionaries is difficult to bridge with C

Ah, that may be a problem. But you can start with non-dictionary arrays. I'd also recommend to start small (only primitive types, for example), check that everything works (especially lifetime handling), then implement more types.

How do we handle threads? We mutex the release?

You don't have to. release should only be called when the consumer is done with the data, so by definition it cannot be called if other consumer threads are accessing data (otherwise it's a bug in the consumer).

pitrou · 2020-09-30T10:30:58Z

In Rust, I see that BufferData is reference-counted in Buffer, so for exporting it should be able to follow the same principles as C++.

I would suggest you start with that: implement only exporting, start with primitive types. Exercise using Python tests to import the data (e.g. pyarrow.Array._import_from_c), and check lifetime handling.

For importing, it seems BufferData may have to define a custom destructor function, or an optional external owner. I'm not a Rust developer, so I can't advise on that :-)

pitrou · 2020-09-30T10:33:05Z

(perhaps there's also a Rust developer that's more familiar with C++ and can help you?)

The existing implementation was not useful to support FFI as it did not specify how to release memory.

jorgecarleitao · 2020-10-06T16:13:34Z

I have been heavily working in this problem based on your ideas, @pitrou on a separate branch, and I think I need some input.

That code is still a mess, as I am still in design/experimentation phase. What it can do so far:

import an array from Python and perform arbitrary operations on it
export an array to Python and perform operations on it (from Python) ...

Step 2 causes a double free and crashes when Python releases the resource. I know why and I am working on it. While working on it, I found the catch, which I would welcome very much your input.

Currently, in Rust, two distinct arrays can share a buffer via an (atomically counted) shared pointer, Arc.

Say we have two arrays A and B that share a buffer. When we export array A, I think that our release cannot just free the buffer: any ref-counts will be ignored and we may end up with a dangling pointer at B. Instead, it seems that we need to keep track of the refcounts.

In this direction, exporting an array (without children for for now) is equivalent to increase the ref count by 1, and releasing the exported array is equivalent to decrease it by 1. Specifically, exporting an array consists of

for each buffer in the array, manually increase its Arc's (strong) refcounts by 1
store the memory location of each of the Arc in private data
build the ABI struct with the private data and expose the pointer to Python/whatever

This is artificially stating that our struct now also shares read ownership over that data. Because the refcount was increased by 1, rust won't free the resources automatically.

Releasing an Array consists of:

read private data and interpret parts of it as Arcs
reduce (strong) refcount of each Arc by 1

Does this make any sense?

Btw, is this what it is meant in this section of the C Data interface wrt to shared_ptrs?

pitrou · 2020-10-06T17:13:48Z

You're doing it wrong. I suggest again that you try to follow how C++ does things, otherwise you'll get lost.

For example, your release callback assumes that buffers have been allocated by Rust. This is trivially wrong if e.g. roundtripping from Python to Rust to Python.

So, what needs to happen is you have something like this (not necessarily working, but you get the idea):

struct ExportedArrayData {
  buffers: Vec<Buffer> buffers,
  // other stuff perhaps...
};

Then your private_data must point to an heap-allocated ExportedArrayData. Your release callback will cast back private_data to ExportedArrayData and destroy it (releasing all the buffers). This can probably be done using:

private_data = Box::new(ExportedArrayData...).into_raw() as *mut c_void when exporting
Box::from_raw(private_data as *mut ExportedArrayData) in the release callback

And again, I suggest you tackle importing and exporting separately.

jorgecarleitao · 2020-10-06T20:25:39Z

Sorry, I think that did not make myself very clear, and/or that the code is still not very well documented, but I have been working towards exactly that.

I was just a bit unsure about what you store in the private data, to help the release. That cleared it up.

nevi-me · 2020-10-07T21:49:14Z

Hey @jorgecarleitao, I'll only be able to look at this either over the weekend or during the coming week

jorgecarleitao · 2020-10-08T03:58:22Z

No need, @nevi-me I am still working on this on another place. I will close it as this won't fly. I will PR separately.

nevi-me · 2020-10-08T04:22:24Z

No worries, I'll still have a look at this so I can see the approach that you're taking

jorgecarleitao · 2020-10-08T04:29:04Z

Prob. the best place to see that is in this PR on my repo. It has the latest version.

jorgecarleitao changed the title ~~ARROW-10110: [Rust] Added new crate with code to consume C Data interface to Rust~~ [Rust] Added new crate with code to consume C Data interface to Rust Sep 27, 2020

jorgecarleitao changed the title ~~[Rust] Added new crate with code to consume C Data interface to Rust~~ ARROW-10111: [Rust] Added new crate with code to consume C Data interface to Rust Sep 27, 2020

andygrove added the Component: Rust label Sep 27, 2020

jorgecarleitao changed the title ~~ARROW-10111: [Rust] Added new crate with code to consume C Data interface to Rust~~ ARROW-10111: [Rust] Added new crate with code that consumes C Data interface Sep 27, 2020

pitrou reviewed Sep 28, 2020

View reviewed changes

rust/arrow/src/array/ffi.rs Outdated Show resolved Hide resolved

jorgecarleitao added 5 commits October 3, 2020 08:39

Improved support for externally owned memory regions.

548cf5b

The existing implementation was not useful to support FFI as it did not specify how to release memory.

Added tests, improved code and docs.

377c908

Improved test

b0ff0ec

Added basics to map data types from C Data Interface.

ad5b220

Added import and export of C Data interface.

6c07575

jorgecarleitao closed this Oct 8, 2020

jorgecarleitao deleted the c-abi-comp branch October 8, 2020 06:01

ARROW-10111: [Rust] Added new crate with code that consumes C Data interface #8287

ARROW-10111: [Rust] Added new crate with code that consumes C Data interface #8287

Uh oh!

Conversation

jorgecarleitao commented Sep 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 27, 2020

Uh oh!

github-actions bot commented Sep 27, 2020

Uh oh!

Uh oh!

pitrou commented Sep 28, 2020

Uh oh!

jorgecarleitao commented Sep 28, 2020

Uh oh!

pitrou commented Sep 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao commented Sep 30, 2020

buffer slices aka buffer offset aka parent_ buffer

Rust's implementation of Dictionary arrays

Threading

Uh oh!

pitrou commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Sep 30, 2020

Uh oh!

jorgecarleitao commented Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao commented Oct 6, 2020

Uh oh!

nevi-me commented Oct 7, 2020

Uh oh!

jorgecarleitao commented Oct 8, 2020

Uh oh!

nevi-me commented Oct 8, 2020

Uh oh!

jorgecarleitao commented Oct 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jorgecarleitao commented Sep 27, 2020 •

edited

Loading

pitrou commented Sep 28, 2020 •

edited

Loading

buffer slices aka buffer `offset` aka `parent_` buffer

pitrou commented Sep 30, 2020 •

edited

Loading

pitrou commented Sep 30, 2020 •

edited

Loading

jorgecarleitao commented Oct 6, 2020 •

edited

Loading

pitrou commented Oct 6, 2020 •

edited

Loading