Skip to content

Conversation

@karldw
Copy link
Contributor

@karldw karldw commented Aug 25, 2021

I took a stab at implementing the approach @nealrichardson laid out in ARROW-12981. Please let me know what you think, and if you'd like any changes!

I wrote some basic tests for the download_optional_dependencies() helper function, but it would be good to have more comprehensive install tests. These could be something like:

export LIBARROW_BINARY=false
export LIBARROW_BUILD=true
export LIBARROW_DOWNLOAD=false
export LIBARROW_MINIMAL=false

# Make sure offline, feature-light installation works
R -e "install.packages('arrow_x.y.z.p.tar.xz')"
R -e 'stopifnot(arrow::arrow_available(), isFALSE(arrow::arrow_info()$capabilities["parquet"]))'

# Download and install the thirdparty features
R -e "arrow::download_optional_dependencies('arrow-thirdparty')"
source arrow-thirdparty/DEFINE_ENV_VARS.sh
R -e "install.packages('arrow_x.y.z.p.tar.xz')
R -e 'stopifnot(arrow::arrow_available(), isTRUE(arrow::arrow_info()$capabilities["parquet"]))'

@github-actions
Copy link

@karldw
Copy link
Contributor Author

karldw commented Aug 25, 2021

These builds are failing because they set LIBARROW_DOWNLOAD is false and they need to download cmake, but my changes block downloading cmake when LIBARROW_DOWNLOAD is false (or when github.com can't be reached). Should I allow cmake to be downloaded here and assume offline builds have cmake installed?

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! Some initial notes, will give it a closer read later

@karldw
Copy link
Contributor Author

karldw commented Aug 25, 2021

In the latest commit, I removed LIBARROW_DOWNLOAD and added TEST_OFFLINE_BUILD. Does that seem right to you?

I wasn't positive I got the logic right in this section of configure:
https://github.com/apache/arrow/pull/11001/files#diff-089697faebdb7820ca629a2bb316b878cc0ba18a5bfb0b60996f8dbcd1fa11e7L133-L140

@nealrichardson
Copy link
Member

In the latest commit, I removed LIBARROW_DOWNLOAD and added TEST_OFFLINE_BUILD. Does that seem right to you?

I think so, but we'll know for sure once we set up CI.

I wasn't positive I got the logic right in this section of configure:
https://github.com/apache/arrow/pull/11001/files#diff-089697faebdb7820ca629a2bb316b878cc0ba18a5bfb0b60996f8dbcd1fa11e7L133-L140

Yes, looks right, I just suggested a further simplification now that we can.

As for CI, there will be an arrow-r-nightly change needed in order to do the rsync etc. that you added to r/Makefile, but the regular CI we want will be in arrow, in our "crossbow" nightly and on-demand builds. There's a bunch of yaml that configures templates here, if you want to take a stab at it. @jonkeane is back from vacation next week and can help with setting that up too.

Also, I just want to reiterate: this is great, thank you very much for taking the initiative on this.

karldw added 2 commits August 26, 2021 15:55
- That is, use the `with_mimalloc()` and `with_s3_support()` functions
- Remove an unused function
- Add a task `test-r-offline-minimal` that sets `TEST_OFFLINE_BUILD`
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're getting closer! Need to look into CI next so we can confirm that this is actually doing what we think it is :)

@karldw
Copy link
Contributor Author

karldw commented Aug 27, 2021

Looking through the logs, I'm still downloading XSIMD when TEST_OFFLINE_BUILD is true and ARROW_THIRDPARTY_DEPENDENCY_DIR isn't set. The ARROW_SIMD_LEVEL setting is getting picked up, but somehow that doesn't translate to not using XSIMD.

/usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
-DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
-DARROW_CSV=ON -DARROW_DATASET=OFF -DARROW_DEPENDENCY_SOURCE=BUNDLED 
-DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=OFF 
-DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=OFF 
-DARROW_S3=OFF -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF 
-DARROW_WITH_LZ4=OFF -DARROW_WITH_RE2=OFF -DARROW_WITH_SNAPPY=OFF 
-DARROW_WITH_UTF8PROC=OFF -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF 
-DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release 
-DCMAKE_INSTALL_LIBDIR=lib 
-DCMAKE_INSTALL_PREFIX=/tmp/Rtmpt7emWm/R.INSTALL27343f31d8de/arrow/libarrow/arrow-5.0.0.9000
-DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=ON 
-DARROW_SIMD_LEVEL=NONE 
-G 'Unix Makefiles' /tmp/Rtmpt7emWm/R.INSTALL27343f31d8de/arrow/tools/cpp

Then later:
--   ARROW_SIMD_LEVEL=NONE [default=NONE|SSE4_2|AVX2|AVX512]
--       Compile-time SIMD optimization level
--   ARROW_RUNTIME_SIMD_LEVEL=MAX [default=NONE|SSE4_2|AVX2|AVX512|MAX]
--       Max runtime SIMD optimization level

- Move `download_optional_dependencies()` function
- Change output of `download_optional_dependencies` to directory used, and
  input to `ARROW_THIRDPARTY_DEPENDENCY_DIR` if it's set.
- Enable `ARROW_VERBOSE_THIRDPARTY_BUILD` if `ARROW_R_DEV` is true and we're
  setting `*_SOURCE_URL` flags so the printed log shows the file used.
- Change env var management in `nixlibx.R` to work with a vector, rather
  than adding to one long string.
  - Add checks to env vars: names must be standard, values can't contain `'`
@karldw
Copy link
Contributor Author

karldw commented Aug 28, 2021

@github-actions crossbow submit -g test-r-offline-minimal

@nealrichardson
Copy link
Member

Looking through the logs, I'm still downloading XSIMD when TEST_OFFLINE_BUILD is true and ARROW_THIRDPARTY_DEPENDENCY_DIR isn't set. The ARROW_SIMD_LEVEL setting is getting picked up, but somehow that doesn't translate to not using XSIMD.

Looking at https://github.com/apache/arrow/blob/master/cpp/cmake_modules/ThirdpartyToolchain.cmake#L1929-L1935, it looks like you have to set the RUNTIME level to NONE also.

@jonkeane
Copy link
Member

@github-actions crossbow submit test-r-offline-minimal

@github-actions
Copy link

Revision: 5a13cbf

Submitted crossbow builds: ursacomputing/crossbow @ actions-802

Task Status
test-r-offline-minimal Azure

@jonkeane
Copy link
Member

jonkeane commented Sep 3, 2021

@github-actions crossbow submit test-r-offline-maximal

@github-actions
Copy link

github-actions bot commented Sep 3, 2021

Revision: 939cb87

Submitted crossbow builds: ursacomputing/crossbow @ actions-816

Task Status
test-r-offline-maximal Github Actions

@jonkeane
Copy link
Member

jonkeane commented Sep 3, 2021

Aaah, actually the original configuration was fine (though I've adjusted the "Dump test logs" step to always be run (so that it's easier to confirm without downloading the artifacts).

This took a bit of RTFM, but the output of tools::testInstalledPackage() (run against arrow) is placed in arrow-tests of the working directory. So if the tests were not run, the output file would not exist or be blank, but we do see output from it showing that the tests are running. When this run finished I will (double) check that we're not seeing a bunch of skips for various features that are disabled.

@jonkeane
Copy link
Member

jonkeane commented Sep 3, 2021

Ok, this looks good. Though I'm seeing that S3 was disabled, I'll make sure we've got the dependencies installed on the host so that we can catch that too.

Here are the skips:

══ Skipped tests ═══════════════════════════════════════════════════════════════
• ARROW-11090 (date/datetime arithmetic) (1)
• ARROW-12632: ExecuteScalarExpression cannot Execute non-scalar expression (1)
• ARROW-13364 (1)
• ARROW-13691 - na.rm not yet implemented for VarianceOptions (2)
• ARROW-13799: factor() should error but instead we get a string error message in its place (1)
• Arrow C++ not built with s3 (4)
• Flight server is not running (1)
• Implement more aggressive implicit casting for scalars (ARROW-11402) (1)
• Ingest_POSIXct only implemented for REALSXP (1)
• Minio is not running (1)
• Need halffloat support: https://issues.apache.org/jira/browse/ARROW-3802 (1)
• Need to substitute in user defined function too (1)
• RE2 does not support backreferences in pattern (https://github.com/google/re2/issues/101) (1)
• Sorting by only a single timestamp column fails (ARROW-12087) (1)
• TODO: (if anyone uses RangeEquals) (1)
• Table with 0 cols doesn't know how many rows it should have (2)
• These tests are flaking: https://github.com/duckdb/duckdb/issues/2100 (1)
• This OS either does not support changing languages to fr or it caches translations (1)
• Work around masking of data type functions (ARROW-12322) (1)
• count() is not a generic so we have to get here through summarize() (1)
• empty test (1)
• environment variable ARROW_LARGE_MEMORY_TESTS (2)
• https://issues.apache.org/jira/browse/ARROW-7653 (1)
• packageVersion("stringr") > "1.4.0" is not TRUE (1)
• test now faulty - code no longer gives error & outputs a empty tibble (1)

@jonkeane
Copy link
Member

jonkeane commented Sep 3, 2021

@github-actions crossbow submit -g r

I'm running the full suite again since we're close and want to make sure this didn't (accidentally) do anything to our other builds

@github-actions
Copy link

github-actions bot commented Sep 3, 2021

Revision: 6bf7b85

Submitted crossbow builds: ursacomputing/crossbow @ actions-817

Task Status
conda-linux-gcc-py36-cpu-r40 Azure
conda-linux-gcc-py37-cpu-r41 Azure
conda-osx-clang-py36-r40 Azure
conda-osx-clang-py37-r41 Azure
conda-win-vs2017-py36-r40 Azure
conda-win-vs2017-py37-r41 Azure
homebrew-r-autobrew Github Actions
test-r-depsource-auto Azure
test-r-depsource-system Github Actions
test-r-devdocs Github Actions
test-r-gcc-11 Github Actions
test-r-install-local Github Actions
test-r-linux-as-cran Github Actions
test-r-linux-rchk Github Actions
test-r-linux-valgrind Azure
test-r-minimal-build Azure
test-r-offline-maximal Github Actions
test-r-offline-minimal Azure
test-r-rhub-debian-gcc-devel-lto-latest Azure
test-r-rhub-ubuntu-gcc-release-latest Azure
test-r-rocker-r-base-latest Azure
test-r-rstudio-r-base-3.6-bionic Azure
test-r-rstudio-r-base-3.6-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-3.6-centos8 Azure
test-r-rstudio-r-base-3.6-opensuse15 Azure
test-r-rstudio-r-base-3.6-opensuse42 Azure
test-r-ubuntu-21.04 Github Actions
test-r-version-compatibility Github Actions
test-r-versions Github Actions
test-ubuntu-18.04-r-sanitizer Azure

@karldw
Copy link
Contributor Author

karldw commented Sep 3, 2021

@github-actions crossbow submit test-r-offline-minimal

@github-actions
Copy link

github-actions bot commented Sep 3, 2021

Revision: ec726d6

Submitted crossbow builds: ursacomputing/crossbow @ actions-818

Task Status
test-r-offline-minimal Azure

@jonkeane
Copy link
Member

jonkeane commented Sep 3, 2021

I'll give this one last read through before merging, but I think this is good to go. Thank you for all this work + taking the journey with us as we found the best way to accomplish this.

@karldw
Copy link
Contributor Author

karldw commented Sep 3, 2021

Thanks! I learned a bunch doing this.

I had a couple minor questions, following up on comments from @nealrichardson:

  1. Should I swap the argument order for create_package_with_all_dependencies?
  2. Should create_package_with_all_dependencies check ARROW_THIRDPARTY_DEPENDENCY_DIR?
  3. Before this is finalized, do you want to change any of the new names I've made up? (edit)

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this and it's looking good. I've made a few suggestions (mostly wording) and one question about a possible failure mode.

  1. Should I swap the argument order for create_package_with_all_dependencies?
    I commented on create_package_with_all_dependencies() argument order inline with my comments (I think it's fine as it is, but also fine the other way)
  1. Should create_package_with_all_dependencies check ARROW_THIRDPARTY_DEPENDENCY_DIR?
    I don't think it needs to — if one had dependencies already one could use the hands on approach. I also suspect that for most people creating a full new download will result in the best experience to avoid version mismatches and the like (yeah, it's a little bandwidth wasteful, but not too much)
  1. Before this is finalized, do you want to change any of the new names I've made up? (edit)
    I think the names are good personally. create_package_with_all_dependencies() is the main UX entry point, it's long but nice and descriptive. We could 🚲 🏠 a bit and change package to bundle but I'm not sure that's a user-experience enhancement.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful work, thank you for doing this!

@jonkeane jonkeane force-pushed the fix-12981 branch 2 times, most recently from e3e4f60 to 7a0bdd4 Compare September 7, 2021 14:49
@jonkeane jonkeane closed this in 9064fa0 Sep 7, 2021
@karldw
Copy link
Contributor Author

karldw commented Sep 7, 2021

Thank you both for all your help and patience getting this across the finish line!

@karldw karldw deleted the fix-12981 branch September 7, 2021 16:11
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
I took a stab at implementing the approach @nealrichardson laid out in [ARROW-12981](https://issues.apache.org/jira/browse/ARROW-12981?focusedCommentId=17400415#comment-17400415). Please let me know what you think, and if you'd like any changes!

I wrote some basic tests for the `download_optional_dependencies()` helper function, but it would be good to have more comprehensive install tests. These could be something like:

```sh
export LIBARROW_BINARY=false
export LIBARROW_BUILD=true
export LIBARROW_DOWNLOAD=false
export LIBARROW_MINIMAL=false

# Make sure offline, feature-light installation works
R -e "install.packages('arrow_x.y.z.p.tar.xz')"
R -e 'stopifnot(arrow::arrow_available(), isFALSE(arrow::arrow_info()$capabilities["parquet"]))'

# Download and install the thirdparty features
R -e "arrow::download_optional_dependencies('arrow-thirdparty')"
source arrow-thirdparty/DEFINE_ENV_VARS.sh
R -e "install.packages('arrow_x.y.z.p.tar.xz')
R -e 'stopifnot(arrow::arrow_available(), isTRUE(arrow::arrow_info()$capabilities["parquet"]))'
```

Closes apache#11001 from karldw/fix-12981

Lead-authored-by: karldw <karldw@users.noreply.github.com>
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Jonathan Keane <jkeane@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants