Skip to content

Conversation

@jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Sep 27, 2020

This PR:

  • adds a crate where we can use Rust + Python + Pyarrow together to perform round-trips between them.
  • adds the C Data Interface as a Rust struct, so that we can pass memory addresses around.

@jorgecarleitao jorgecarleitao changed the title ARROW-10110: [Rust] Added new crate with code to consume C Data interface to Rust [Rust] Added new crate with code to consume C Data interface to Rust Sep 27, 2020
@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@jorgecarleitao jorgecarleitao changed the title [Rust] Added new crate with code to consume C Data interface to Rust ARROW-10111: [Rust] Added new crate with code to consume C Data interface to Rust Sep 27, 2020
@github-actions
Copy link

@jorgecarleitao jorgecarleitao changed the title ARROW-10111: [Rust] Added new crate with code to consume C Data interface to Rust ARROW-10111: [Rust] Added new crate with code that consumes C Data interface Sep 27, 2020
@pitrou
Copy link
Member

pitrou commented Sep 28, 2020

As a general design suggestion, on the C++ side we have separate structs (classes) for the import (consumer) side and the export (producer) side. Just 2 cents, though.

@jorgecarleitao
Copy link
Member Author

As a general design suggestion, on the C++ side we have separate structs (classes) for the import (consumer) side and the export (producer) side. Just 2 cents, though.

Can you explain the rational for the two structs? Is it due to different ownership rules?

The design so far (for consumption) in this PR:

  • Have two structs (FFI_ArrowArray and FFI_ArrowSchema) that have ABI compatibility with ArrowArray and ArrowSchema struct in the C data interface.
  • Have one struct that owns both FFI_ArrowSchema and FFI_ArrowArray, that knows how to convert from and to Rust's implementation of an Arrow Array.

I am still a bit lost as to who owns what: when we pass the pointer from C to Rust, does C assume that it should not free the resources if it goes out of scope? I.e. what is the contract between the consumer and the producer with respect to whom should free memory (rust's by calling the "free")? Or is the contract: both check for pointer nullability of the release and call it accordingly. If so, what about threat safety?

@pitrou
Copy link
Member

pitrou commented Sep 28, 2020

Having FFI_ArrowSchema and FFI_ArrowArray is fine, however they should probably minimally mirror the C struct and nothing else.

In addition to those two low-level structs, you need specific structs for importing and exporting.

When exporting, you need an allocated struct that gets stored in the void* private_data, and that will store ownership data (for example, a reference count). Your release callback will then dereferences the private_data pointer and release ownership (for example, by decrementing the reference count).

You can see the C++ array release callback here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L482-L502
Note the last line:

  delete reinterpret_cast<ExportedArrayPrivateData*>(array->private_data);

This is taking the private_data, interpreting it as a pointer to the ExportedArrayPrivateData struct, and destroying it. The ExportedArrayPrivateData contains a reference-counted pointer to ArrayData, which is the actual C++ Arrow class containing the data.

When importing, you need a struct that contains the FFI_ArrowArray (if importing an array). That struct needs to be kept alive by the corresponding Rust Arrow Array (for example, in C++ we have a Buffer class that can be subclassed for different kinds of buffers: allocated by C++, allocated by Python...) (*). When the struct's Rust destructor is called, it should call the embedded FFI_ArrowArray's release callback.

If you choose to manage ownership through buffers, since an array will have several buffers you probably want your importing struct to reference-count the FFI_ArrowArray (so that it will be released when all buffers are destroyed). I've found this example that might be useful: https://users.rust-lang.org/t/how-to-safely-deal-with-c-uint8-t-or-char-in-rust/43109/6

You can see the exact C++ equivalent here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L1144-L1177

  • the ImportedArrayData is a simple wrapper struct around ArrowArray with an additional destructor that calls the release callback
  • the ImportedBuffer has a reference-counted pointer to ImportedArrayData, such that the last buffer depending on a ImportedArrayData will release it when it disappears

Please tell me if that's clear, or don't hesitate to ask more questions :-)

@jorgecarleitao
Copy link
Member Author

To set expectations right, IMO this is a very difficult task.

IMO rhere are at the moment 3 issues:

buffer slices aka buffer offset aka parent_ buffer

Rust and C++ use slightly different approaches to slicing buffers.

  • In C++, we assign a parent_ buffer whenever we slice a buffer.
  • In Rust, the raw data is known as a BufferData, and a buffer is composed by a BufferData and an offset (into the data)

In both cases, memory management is tricky. If we slice a buffer from C++ and export it, does it knows that it cannot release the content?

Specifically:

  1. create a buffer A in C++
  2. slice it into buffer B (which now has a parent -> A)
  3. export B to Rust
  4. Rust calls B->release
  5. C++ access the contents (via A) on a region that overlaps with B (UB?)

I was unable to find any reference to the buffer's parent in bridge.cc, nor any shared pointer to the sliced region.

I am asking because we have an analogous problem in Rust, but in Rust we use a shared point to memory region (BufferData), which I think protect us from this behavior. Specifically, a Buffer is rust is composed by:

  • an Arc to the actual region
  • an offset of where to start from in that region (non-zero in slicing)

Rust's implementation of Dictionary arrays

I think that Rust's implementation of dictionaries is difficult to bridge with C, as it assumes dictionary data is owned by a struct that is not ArrayData. I think that we will need to address this first. I raised this in ARROW-10128

Threading

How do we handle threads? We mutex the release?

@pitrou
Copy link
Member

pitrou commented Sep 30, 2020

If we slice a buffer from C++ and export it, does it knows that it cannot release the content?

Yes, it does. I think you're overestimating the difficulty here, it's actually quite simple. In C++, Buffers (and slices, which are Buffers) are reference-counted (using the shared_ptr class). To take your example:

  • A is reference-counted (at the start, the reference count is 1 since C++ holds a reference)
  • B is reference-counted, and also increments A's reference count (through the parent pointer)
  • the exported C data increments B's reference count (through the allocated private_data)
  • when Rust releases the C data, the release callback decrements B's reference count (by destroying the private_data)
  • if B's reference count has dropped to 0, it is destroyed, which also decrements A's reference count (through parent)
  • as long as C++ has a strong reference to A, its reference count is >= 1, so it isn't destroyed

And anyway, Rust shouldn't worry about what happens on the C++ side. You're getting a ArrowArray struct, which may come from C++, but may also come from something else (e.g. DuckDB has started supporting the C data interface).

I think that Rust's implementation of dictionaries is difficult to bridge with C

Ah, that may be a problem. But you can start with non-dictionary arrays. I'd also recommend to start small (only primitive types, for example), check that everything works (especially lifetime handling), then implement more types.

How do we handle threads? We mutex the release?

You don't have to. release should only be called when the consumer is done with the data, so by definition it cannot be called if other consumer threads are accessing data (otherwise it's a bug in the consumer).

@pitrou
Copy link
Member

pitrou commented Sep 30, 2020

In Rust, I see that BufferData is reference-counted in Buffer, so for exporting it should be able to follow the same principles as C++.

I would suggest you start with that: implement only exporting, start with primitive types. Exercise using Python tests to import the data (e.g. pyarrow.Array._import_from_c), and check lifetime handling.

For importing, it seems BufferData may have to define a custom destructor function, or an optional external owner. I'm not a Rust developer, so I can't advise on that :-)

@pitrou
Copy link
Member

pitrou commented Sep 30, 2020

(perhaps there's also a Rust developer that's more familiar with C++ and can help you?)

@jorgecarleitao
Copy link
Member Author

jorgecarleitao commented Oct 6, 2020

I have been heavily working in this problem based on your ideas, @pitrou on a separate branch, and I think I need some input.

That code is still a mess, as I am still in design/experimentation phase. What it can do so far:

  1. import an array from Python and perform arbitrary operations on it
  2. export an array to Python and perform operations on it (from Python) ...

Step 2 causes a double free and crashes when Python releases the resource. I know why and I am working on it. While working on it, I found the catch, which I would welcome very much your input.

Currently, in Rust, two distinct arrays can share a buffer via an (atomically counted) shared pointer, Arc.

Say we have two arrays A and B that share a buffer. When we export array A, I think that our release cannot just free the buffer: any ref-counts will be ignored and we may end up with a dangling pointer at B. Instead, it seems that we need to keep track of the refcounts.

In this direction, exporting an array (without children for for now) is equivalent to increase the ref count by 1, and releasing the exported array is equivalent to decrease it by 1. Specifically, exporting an array consists of

  1. for each buffer in the array, manually increase its Arc's (strong) refcounts by 1
  2. store the memory location of each of the Arc in private data
  3. build the ABI struct with the private data and expose the pointer to Python/whatever

This is artificially stating that our struct now also shares read ownership over that data. Because the refcount was increased by 1, rust won't free the resources automatically.

Releasing an Array consists of:

  1. read private data and interpret parts of it as Arcs
  2. reduce (strong) refcount of each Arc by 1

Does this make any sense?

Btw, is this what it is meant in this section of the C Data interface wrt to shared_ptrs?

@pitrou
Copy link
Member

pitrou commented Oct 6, 2020

You're doing it wrong. I suggest again that you try to follow how C++ does things, otherwise you'll get lost.

For example, your release callback assumes that buffers have been allocated by Rust. This is trivially wrong if e.g. roundtripping from Python to Rust to Python.

So, what needs to happen is you have something like this (not necessarily working, but you get the idea):

struct ExportedArrayData {
  buffers: Vec<Buffer> buffers,
  // other stuff perhaps...
};

Then your private_data must point to an heap-allocated ExportedArrayData. Your release callback will cast back private_data to ExportedArrayData and destroy it (releasing all the buffers). This can probably be done using:

  • private_data = Box::new(ExportedArrayData...).into_raw() as *mut c_void when exporting
  • Box::from_raw(private_data as *mut ExportedArrayData) in the release callback

And again, I suggest you tackle importing and exporting separately.

@jorgecarleitao
Copy link
Member Author

Sorry, I think that did not make myself very clear, and/or that the code is still not very well documented, but I have been working towards exactly that.

I was just a bit unsure about what you store in the private data, to help the release. That cleared it up.

@nevi-me
Copy link
Contributor

nevi-me commented Oct 7, 2020

Hey @jorgecarleitao, I'll only be able to look at this either over the weekend or during the coming week

@jorgecarleitao
Copy link
Member Author

No need, @nevi-me I am still working on this on another place. I will close it as this won't fly. I will PR separately.

@nevi-me
Copy link
Contributor

nevi-me commented Oct 8, 2020

No worries, I'll still have a look at this so I can see the approach that you're taking

@jorgecarleitao
Copy link
Member Author

Prob. the best place to see that is in this PR on my repo. It has the latest version.

@jorgecarleitao jorgecarleitao deleted the c-abi-comp branch October 8, 2020 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants