Skip to content

Conversation

@lafiona
Copy link
Contributor

@lafiona lafiona commented Dec 20, 2021

Overview

This pull request implements a more scalable and sustainable approach to organizing the C++ functionality that needs to be exposed to MATLAB. It adds a new singular MEX gateway function, mexfcn('<cpp-function-name>', <cpp-function-argument-1>, ..., <cpp-function-argument-N>), which delegates to specific C++ functions. To make use of mexfcn, the directory containing mexfcn.<mex-extension> must be added to the MATLAB path.

Advantages to this approach:

  • Organization
    • All C++ functions that are exposed to MATLAB are registered in one location
    • Reduce the complexity of managing and linking many MEX files and ensuring that they are added to the MATLAB path
    • Reduce cognitive load for adding new functions
  • Avoid polluting the source tree with build artifacts
  • Reduce build times, building a single MEX function is faster than building potentially hundreds
  • Reduce binary bloat caused by creating a separate MEX file for every MEX function
  • Enable flexibility in terms of where C++ implementation files live

Implementation

  1. One MEX function, mexfcn, defined in matlab/src/cpp/arrow/matlab/mex/mex_util.h, dispatches to individual C++ implementation files.

For example, to invoke featherread functionality from MATLAB, that is implemented in C++:

>> [variables, metadata] = mexfcn('featherread', 'featherfile.feather');
  1. Functionality implemented in C++ that we want to expose in MATLAB is registered in a function map in the file matlab/src/cpp/arrow/matlab/mex/mex_functions.h.
  2. Restructured source tree layout and performed general code clean up in preparation for feature implementation work:
    3.1. Split source code to matlab/src/matlab and matlab/src/cpp
    3.2 Make packages, namespaces, and directories consistent in terms of naming and hierarchy for simplifying navigation and header inclusion.
    3.3 Renamed the MATLAB package name mlarrow to arrow as the ml is superfluous.
  3. Refactored matlab/CMakeLists.txt:
    4.1. Build shared library, arrow_matlab, that contains C++ functionality for the interface.
    4.2. macOS: explicitly add the path of arrow to the rpath of arrow_matlab, as paths of libraries output by imported targets are not automatically included.
    4.3. Windows:
    • Copy shared library, arrow.dll to the directory where the MEX function lives.
    • Add the path to MATLAB and GTest shared libraries to the ctest ENVIRONMENT when building tests.
    • Specify the release version of MSVC Runtime Libraries for all targets created in the CMake file.
  4. Enable the install target:
    5.1. Utilize CMAKE_INSTALL_PREFIX to allow users to customize the install location. The platform-specific default install locations will be used if the user does not specify a custom value.
    5.2. Once installed, the interface's source files and libraries are relocatable, on all platforms.
    5.3. As part of the install step, add the path to the install directory to the MATLAB Search Path by:
    • Option 1: Call addpath and savepath to modify the pathdef.m file that MATLAB uses on startup. This option is on by default. However, it can only be used if CMake has the appropriate permissions to modify pathdef.m. This option is toggled on and off by the flag: MATLAB_ADD_INSTALL_DIR_TO_SEARCH_PATH.
    • Option 2: Add an addpath command to the startup.m file located at the userpath. This option can be used if a user does not have the permissions to modify the pathdef.m file. This option is off, by default, and is toggled on and off by the flag: MATLAB_ADD_INSTALL_DIR_TO_STARTUP_FILE.
    • Option 3: Add the path to the install directory to the MATLABPATH environment variable. The paths listed in MATLABPATH are added to the MATLAB Search Path on start up.
    • The MATLAB Actions install of MATLAB used during CI does not have userpath set, nor do we have edit permissions for pathdef.m. Therefore, we use the third option and set the MATLABPATH environment variable in matlab.yml.

Testing

Qualified CMakeLists.txt changes by building and running all tests:

  • On Windows 10 (Ninja and Visual Studio), macOS 11.5 (Make and Ninja), and Debian 10 (Make and Ninja)
  • Configurations: build both Arrow and GTest, use provided ARROW_HOME, use provided GTEST_ROOT, use both ARROW_HOME and GTEST_ROOT.

Future Directions

  1. Investigate why the default CMake behavior does not link the test executables against the correct MSVC Runtime libraries (ie. ucrtbase.dll versus ucrtbased.dll) when building with Ninja on Windows.
  2. Add support for specifying function names and arguments as MATLAB strings to mexfcn. Currently, only character vectors are supported.
  3. Refactor mexfcn to use MATLAB Data Arrays (MDAs) and C++ Mex.
  4. Investigate running MATLAB tests using matlab_add_unit_test.

Notes

  1. Thank you for all of your help on this pull request, @sgilmore10 and @kevingurney!

sgilmore10 and others added 25 commits December 20, 2021 13:18
added to the build folder - not the source folder.
Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Store the Configuration Type value for multi-configuration generators in
a txt file for use during testing

Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
…AB test folder. Add directory that contains libmx.dll and libmex.dll to the %PATH% for running CheckNumArgsTestTarget.

Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
…shared as a target_link_option for arrow_matlab, as the rpath is not automatically added when linking imported libraries.

Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
…release version and add the link libraries to arrow_ep on Windows as BUILD_BYPRODUCTS.

Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
@github-actions
Copy link

@lafiona
Copy link
Contributor Author

lafiona commented Dec 20, 2021

I apologize for the size of this pull request. We found it tricky to partition our changes into separate pull requests as they involved modifying the build system and refactoring the structure of the source files. However, in the future, our goal is to submit smaller, more incremental pull requests to reflect more isolated changes.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add GitHub Actions jobs on macOS and Windows to .github/workflows/matlab.yml?

Do you want to work on ARROW-13564 to apply "Co-authored-by" in pull request commits before we merge this?

TARGET GTest::gtest_main
PROPERTY IMPORTED_LOCATION)

add_custom_target(copy_gtest_to_tests_dir ALL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't try this but is this target always run even when GTest files aren't rebuilt?
Can we use BYPRODUCTS and DEPENDS to prevent needless execution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code is not necessary anymore after adding the path of the shared libraries to the test environment path.

set(MATLAB_DLL_DEPENDENCIES_DIR "${Matlab_ROOT_DIR}/bin/win64")
set_tests_properties(CheckNumArgsTestTarget
PROPERTIES ENVIRONMENT
"PATH=${MATLAB_DLL_DEPENDENCIES_DIR}\;$ENV{PATH}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using %PATH% for GTest too instead of copying GTest files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion, it'll be more efficient to skip copying the GTest dlls. I will make this change.

POST_BUILD
COMMAND "${CMAKE_COMMAND}" -E echo $<CONFIG> >
${CMAKE_BINARY_DIR}/is_multi_config.txt)
endif()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we deffer this to a follow-up pull request to reduce changes in this pull request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me! I have removed the code to create is_multi_config.txt. And I will add it in a future pull request for ARROW-15182.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
Could you update the description of this pull request for the latest code?
We use the description on merge.

COMMAND ${CMAKE_COMMAND} -E copy_if_different ${ARROW_SHARED_LIB}
${MATLAB_BUILD_OUTPUT_DIR}
COMMENT "Start copying arrow.dll from ${ARROW_SHARED_LIB} to ${MATLAB_BUILD_OUTPUT_DIR}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use %PATH% for this too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kou, I've updated the description to reflect the changes that I made in response to your feedback.

arrow.dll is not only used by the tests. On Windows, it needs to be copied to the location of arrow_matlab.dll and the MEX function, mexfcn.mexw64, so that they can access the Arrow APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In normal use (not test use), users run cmake --build ... --target install before they use this module, right?
If it's correct, arrow.dll should be located in the same directory as arrow_matlab.dll and mexfcn.mexw64 by cmake --build ... --target install.

If your explanation is correct (we need this for not test use too), we should use install() https://cmake.org/cmake/help/latest/command/install.html instead of add_custom_target().

I think that we can use one of them:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this feedback, @kou! I have thought more about how we are structuring the source files and libraries for deployment.

As you mentioned, it is advantageous to use install() to move arrow.dll to the installed location for normal use.

With the previous behavior of copying to the build tree, the MATLAB tests were able to find mexfcn and its dependencies by adding the path to the build tree to the MATLAB Search Path. However, enabling the install step allows a user to install the interface files (including source files and mex functions) to any location that the user specifies. Therefore, we enabled options to add the install directory, specified by CMAKE_INSTALL_PREFIX, to the MATLAB Search Path, as part of the install step. I have updated the pull request description with more details to reflect the changes we have made.

For running the C++ tests on Windows and macOS, the paths to arrow.dll/libarrow.dylib are added to the ctest environment to eliminate the need to copy at all.

@edponce
Copy link
Contributor

edponce commented Dec 29, 2021

Simply out of curiousity, mexfcn does not supports C++ functions that take FunctionOptions in addition to the function arguments?

@lafiona
Copy link
Contributor Author

lafiona commented Jan 3, 2022

@edponce, mexfcn only receives input arguments from MATLAB. End users will interact with the MATLAB APIs, which will make use of mexfcn. mexfcn will dispatch to individual C++ functions that implement the necessary function specific behavior.

Depending on the needs of a particular C++ function, developers can pass options in the way that they deem appropriate, in order to get that information to the C++ layer. mexfcn accepts a variable number of input arguments, of any type, from MATLAB. So a developer could pass an options struct to mexfcn in order to propagate any options to the C++ layer. The translation from the options struct to C++ data structures would need to be taken care of by the specific C++ functions.

lafiona and others added 3 commits January 20, 2022 15:55
…he directory to the MATLAB Search Path.

Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
the MATLAB Search Path. In Github Actions, the installation directory
will be made available to the MATLAB tests by setting the MATLABPATH
environment variable.

Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
set(ARROW_LIBRARY_DIR "${ARROW_PREFIX}/lib")
set(ARROW_SHARED_LIB
"${ARROW_LIBRARY_DIR}/${CMAKE_SHARED_LIBRARY_PREFIX}arrow${ARROW_LIBRARY_SUFFIX}")
set(ARROW_LINK_LIB_FILENAME "${CMAKE_IMPORT_LIBRARY_PREFIX}arrow${ARROW_LINK_SUFFIX}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we can always use CMAKE_IMPORT_LIBRARY_SUFFIX here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Than you for this suggestion! I will make this change for arrow_shared and the GTest targets.

Comment on lines 102 to 103
set_target_properties(${ARROW_LIBRARY_TARGET} PROPERTIES IMPORTED_IMPLIB
${ARROW_LINK_LIB})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always set IMPORTED_IMPLIB here?
I expect it's only used on Windows. (It's ignored on non Windows environments such as Linux and macOS.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's true that the IMPORTED_IMPLIB variable is only used on Windows. Because imported libraries are not used on macOS and Linux, CMAKE_IMPORT_LIBRARY_SUFFIX and CMAKE_IMPORT_LIBRARY_PREFIX are not set on these platforms. They are used to create the file path contained by ARROW_LINK_LIB. Therefore, the value of ARROW_LINK_LIB is not set for macOS and Linux.

Because of this, I would prefer to only set IMPORTED_IMPLIB on Windows.

set(ARROW_GTEST_LIBRARY_DIR "${ARROW_GTEST_PREFIX}/lib")
set(ARROW_GTEST_LINK_LIB_DIR "${ARROW_GTEST_PREFIX}/lib")
set(ARROW_GTEST_LINK_LIB
"${ARROW_GTEST_LINK_LIB_DIR}/${CMAKE_IMPORT_LIBRARY_PREFIX}gtest${ARROW_GTEST_LINK_SUFFIX}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we can always use CMAKE_IMPORT_LIBRARY_SUFFIX here.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kou kou closed this in 9c5e4b3 Feb 6, 2022
@ursabot
Copy link

ursabot commented Feb 6, 2022

Benchmark runs are scheduled for baseline = dc2f678 and contender = 9c5e4b3. 9c5e4b3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.19% ⬆️0.04%] test-mac-arm
[Finished ⬇️0.71% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.09% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@lafiona
Copy link
Contributor Author

lafiona commented Feb 7, 2022

Thank you for code reviewing this change and accepting the pull request, @kou!

I am able to see the commit that merged this change into the main branch, however, I see a message on this conversation thread named "Closed with unmerged commits". Do you know what causes this message and are there any action items for me to resolve it?

@kou
Copy link
Member

kou commented Feb 7, 2022

@lafiona Sorry for confusing you by our merge method. You don't need extra action for this pull request.

We use our merge script https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py to merge a pull request. It squashes commits in the pull request, pushes the squashed commit to master and updates the correspondent JIRA issue. We don't use GitHub's "merge button". So "Closed with unmerged commits" is showed on GitHub UI. But you don't need extra action because we have changes in this pull request on master.

@lafiona
Copy link
Contributor Author

lafiona commented Feb 7, 2022

@kou, no worries, thank you for explaining why this occurs! Good to know that it is expected behavior. Thanks again for your time reviewing this change!

@lafiona lafiona deleted the ARROW_13185 branch February 7, 2022 22:06
kou pushed a commit that referenced this pull request Feb 10, 2022
…o reflect latest CMake build system changes

# Overview

This pull request:

1. Updates the high level documentation for the MATLAB interface in `arrow/matlab/README.md` to reflect the latest CMake build system changes that were merged upstream as part of [#12004](#12004).

# Implementation

1. Restructured the `README.md` to focus on the most common build and test workflows for the MATLAB interface.

# Testing

N/A

# Future Directions

1. To be consistent and in-model with the other language bindings (e.g. Python, R, C++, etc.) - we plan to author in-depth technical documentation over time under [`arrow/docs/source/matlab`](https://github.com/apache/arrow/tree/master/docs/source) using the [reStructedText (.rst) format](https://docutils.sourceforge.io/rst.html). The thought behind this `README.md` documentation is to provide a convenient "landing page" for the MATLAB interface which allows end users to quickly get a sense of where the project is at and to get started immediately with building and running the interface. However, every user and developer will have different use cases and needs for working with the MATLAB interface, so we will also want include in-depth documentation for major topics like building, installing, testing, contributing, usage, etc. As the MATLAB interface continues to develop, we will do our best to keep the documentation in sync with the source code.

# Notes

1. Thank you to @lafiona for her help writing this updated documentation!

Closes #12389 from kevingurney/ARROW-13204

Lead-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Kevin Gurney <5904145+kevingurney@users.noreply.github.com>
Co-authored-by: Fiona La <fionala@mathworks.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
kou added a commit to kou/arrow that referenced this pull request May 18, 2022
We use local "git merge" to merge a pull request in
dev/merge_arrow_pr.py.

If we use "git merge" to merge a pull request, GitHub's Web UI shows
"Closed" mark not "Merged" mark in a pull request page. This sometimes
confuses new contributors. "Why was my pull request closed without
merging?" See
apache#12004 (comment) for
example.

If we use GitHub API
https://docs.github.com/en/rest/pulls/pulls#merge-a-pull-request to
merge a pull request, GitHub's Web UI shows "Merged" mark not "Closed"
mark. See apache#13180 for example. I
used GitHub API to merge the pull request.

And we don't need to create a local branch on local repository to
merge a pull request. But we must specify ARROW_GITHUB_API_TOKEN to
run dev/merge_arrow_pr.py.
kou added a commit that referenced this pull request May 31, 2022
We use local "git merge" to merge a pull request in
dev/merge_arrow_pr.py.

If we use "git merge" to merge a pull request, GitHub's Web UI shows
"Closed" mark not "Merged" mark in a pull request page. This sometimes
confuses new contributors. "Why was my pull request closed without
merging?" See
#12004 (comment) for
example.

If we use GitHub API
https://docs.github.com/en/rest/pulls/pulls#merge-a-pull-request to
merge a pull request, GitHub's Web UI shows "Merged" mark not "Closed"
mark. See #13180 for example. I
used GitHub API to merge the pull request.

And we don't need to create a local branch on local repository to
merge a pull request. But we must specify ARROW_GITHUB_API_TOKEN to
run dev/merge_arrow_pr.py.

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants