Skip to content

[C++][Gandiva] Allow registering external C functions #38589

@niyue

Description

@niyue

Describe the enhancement requested

Description

In issue #37753, Gandiva provides the support to register external functions so that developers can register third party functions to use in Gandiva expressions. However, the supported external functions need to be compiled to LLVM IR so that they can be registered and used. This limitation causes troubles sometimes, in particular when the third party function has some non trivial dependency such as an HTTP library, because it requires compiling all dependent libraries into LLVM IR and compile all the IRs during runtime, which is slow.

Proposal

To address this limitation, I propose to allow registering external C functions to Gandiva, so that Gandiva expression can use these functions without relying on compiling third party functions into LLVM IR. Within Gandiva project, there are already such functions, and they are called stub function internally, but this capability is not exposed to external functions yet.

The following APIs are proposed to be added to the FunctionRegistry API for this purpose:

  • arrow::Status Register(NativeFunction func, void* c_function_ptr, std::optional<FunctionHolderMaker> function_holder_maker = std::nullopt)
    • register a C function into the function registry
    • @param func the registered function's metadata
    • @param c_function_ptr the function pointer to the registered function's implementation
    • @param function_holder_maker optional, this will be used as the function holder if the function requires a function holder, where using FunctionHolderMaker = std::function<arrow::Result<std::shared_ptr<gandiva::FunctionHolder>>(const FunctionNode& function_node)>
  • const std::vector<std::pair<NativeFunction, void*>>& GetCFunctions() const
    * get a list of C functions saved in the registry

Benefits

  • Complex functions that require some dependent libraries can be used without performance penalty. Previously LLVM IR based functions is slow to construct during runtime if the generated LLVM IR is big (> several MB), and since constructing LLVM module requires copying all LLVM bitcode into the modules, the more functions are implemented in LLVM IR, the slower constructing the LLVM module is (unless selective IR loading is supported)
  • LLVM IR does allow users to develop a third party function using different languages. However, complex external functions may use APIs in standard libraries in a language, which makes it necessary to compile that language's standard library into LLVM IR as well. This may not be possible in many languages, additionally, the generated LLVM IR will be too big (dead code elimination doesn't help too much about this as far as I can tell). If we allow using C functions, we could overcome this issue since the standard library usage is typically part of the Gandiva's caller program (statically linked or dynamically loaded)
  • Certain capabilities, like using thread local variables, are not available in current Gandiva's JIT engine (MCJIT engine) when JIT-compiling LLVM IR. We have to upgrade MCJIT engine to Orc v2 engine ([C++][Gandiva] Migration JIT engine from MCJIT to LLJIT #37848) to support this. Some libraries uses thread local variables, such as Rust's std::collections::HashMap, which internally uses thread local variable, and this makes it easily running into this restriction if we are authoring a third party function using Rust. But if we allow C functions, there won't be such limitation.

Notes

Component(s)

C++ - Gandiva

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions