-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6793: [R] Arrow C++ binary packaging for Linux #6068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed but don't try this yet. Sorry.
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @kou! A few responses.
5b532d2 to
3e7353a
Compare
fsaintjacques
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cpp/arrow-0.15.1.9999.zip file should not be checked in.
r/tools/linuxlibs.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you install/build cmake, bison, and flex from R instead of in build_arrow_static.sh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) Convenience: for example, I know that download.file() will work, but I don't know if the system has wget.
(2) IMO cmake should handle everything but cmake (https://issues.apache.org/jira/browse/ARROW-7501) so I'd be happy to delete this code then.
r/tools/linuxlibs.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it requires the distribution token in the path? I foresee a problem with invoking multiple docker images on the same master, e.g. on a developer machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's arbitrary, but if I change it here, I have to change it in the configure script too, and that would mean I'd need to determine the distro/version in the configure script too, and personally I'd prefer to avoid rewriting that logic in bash/sed.
Can we wait to see if what you're fearing is a problem? If it does happen, the package installation should fail pretty fast, and deleting the previously built dir is a quick fix.
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fsaintjacques, some responses. I'll delete that file that I accidentally checked in, thanks for catching that!
r/tools/linuxlibs.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's arbitrary, but if I change it here, I have to change it in the configure script too, and that would mean I'd need to determine the distro/version in the configure script too, and personally I'd prefer to avoid rewriting that logic in bash/sed.
Can we wait to see if what you're fearing is a problem? If it does happen, the package installation should fail pretty fast, and deleting the previously built dir is a quick fix.
r/tools/linuxlibs.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) Convenience: for example, I know that download.file() will work, but I don't know if the system has wget.
(2) IMO cmake should handle everything but cmake (https://issues.apache.org/jira/browse/ARROW-7501) so I'd be happy to delete this code then.
docker-compose.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker is not meaningful in this context. We have an ubuntu-r and another conda-r build, this should align too. How does it differ? Could it replace either one?
According to your comments this should work on any base image with R installed, so we should capture that information somehow. Seemingly rhub provides the compatible base images, so perhaps we should call the docker service as rhub-r instead of r-docker and hardcode the R_DOCKER_ORG to rhub.
I also find the three env variables a bit tedious, how about hardcoding R_DOCKER_ORG as I mentioned above and merge R_DOCKER_IMAGE and R_DOCKER_TAG into RHUB_IMAGE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be ok with deleting the ubuntu-r job; technically that tests something slightly different (building the C++ lib and passing its location to R installation with LIB_DIR/INCLUDE_DIR) but honestly these new jobs test more things that we aren't covering elsewhere (operating systems, static builds)
I don't want to hard-code rhub because I'm not using only R images from the rhub organization. I'm already using images from both rhub and rstudio, and there are other Docker Hub orgs out there that we could draw from, if it turns out that we want/need to expand the build matrix.
I agree that the three env variables are tedious and would be happy to use a single R_DOCKER=rhub/ubuntu-gcc-release:latest var. But see L672 and 677: that would result in an invalid image name of apache/arrow-dev:r-rhub/ubuntu-gcc-release:latest. I considered a few alternatives and this seemed to be the least bad. Plus, it's consistent with the other jobs, which have 2-3 env vars that parametrize them. The defaults are set in .env, just like they are for ARCH and UBUNTU et al.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, how about naming it as r simply then instead of r-docker ? :)
docker-compose build r vs docker-compose build r-docker
r/configure
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a follow-up jira, we should probably have a jira for each todo notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, I have a few other features to ticket for the future.
.github/workflows/r_linux.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the number I would not run all of these builds on each PR commit, although its feedback seems pretty useful, so I'd pick a single build (a relevant one, perhaps ubuntu-gcc-release) and configure it in workflows/r.yml. This way we'd have a build running on each PR commit possibly catching valid issues.
The rest of the builds should run on either push to master or as a cron schedule. @kou opinion?
If want to the remaining builds as cron jobs, then we should place them to r_cron.yml, otherwise they should go to r.yml using the if: github.event_name == 'push' condition for the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can replace the ubuntu-r job in r.yml with one of these, sounds good.
IMO push to master sounds good, until we start seeing GHA backups and need to throttle it more. The trouble with running on master/cron is: how do we get alerted about failures? IIUC we aren't checking the existing cron GHA jobs for failures to include in our daily crossbow email.
If we're not running these on PR commit, then they should be wired in crossbow for on-demand use in a PR review, right? Do we have similar setup for the existing GHA cron jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyone subscribed to the apache/arrow repository's notifications should receive build error notifications for the cron jobs. For example I'm getting a couple ones per day for my fork.
That wiring is manual, we have crossbow jobs defined for most of the docker-compose builds regardless whether we run them as GHA jobs or not. Which means that we can trigger docker builds with comments, even the ones we regularly run on GHA.
To summarize, you're free to choose where to run these docker builds, either via crossbow nightlies (after configuring them tasks.yml) for better notifications, of via GHA actions.
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kszucs !
docker-compose.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be ok with deleting the ubuntu-r job; technically that tests something slightly different (building the C++ lib and passing its location to R installation with LIB_DIR/INCLUDE_DIR) but honestly these new jobs test more things that we aren't covering elsewhere (operating systems, static builds)
I don't want to hard-code rhub because I'm not using only R images from the rhub organization. I'm already using images from both rhub and rstudio, and there are other Docker Hub orgs out there that we could draw from, if it turns out that we want/need to expand the build matrix.
I agree that the three env variables are tedious and would be happy to use a single R_DOCKER=rhub/ubuntu-gcc-release:latest var. But see L672 and 677: that would result in an invalid image name of apache/arrow-dev:r-rhub/ubuntu-gcc-release:latest. I considered a few alternatives and this seemed to be the least bad. Plus, it's consistent with the other jobs, which have 2-3 env vars that parametrize them. The defaults are set in .env, just like they are for ARCH and UBUNTU et al.
.github/workflows/r_linux.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can replace the ubuntu-r job in r.yml with one of these, sounds good.
IMO push to master sounds good, until we start seeing GHA backups and need to throttle it more. The trouble with running on master/cron is: how do we get alerted about failures? IIUC we aren't checking the existing cron GHA jobs for failures to include in our daily crossbow email.
If we're not running these on PR commit, then they should be wired in crossbow for on-demand use in a PR review, right? Do we have similar setup for the existing GHA cron jobs?
2e6e1d6 to
77cf605
Compare
|
@ursabot crossbow submit -g r |
|
|
@ursabot crossbow submit test-r-rhub-ubuntu-gcc-release test-r-rstudio-r-base-3.6-bionic test-r-rstudio-r-base-3.6-centos6 test-r-rstudio-r-base-3.6-opensuse15 test-r-rstudio-r-base-3.6-opensuse42 |
|
AMD64 Conda Crossbow Submit (#85664) builder failed. Revision: 0b1bc93f0f92ca9d8eb1c80b5f470c785b0ff61b Crossbow: |
|
@ursabot crossbow submit test-r-rhub-ubuntu-gcc-release test-r-rstudio-r-base-3.6-bionic test-r-rstudio-r-base-3.6-centos6 test-r-rstudio-r-base-3.6-opensuse15 test-r-rstudio-r-base-3.6-opensuse42 |
|
AMD64 Conda Crossbow Submit (#85673) builder has been succeeded. Revision: 3a0abf03c07473a84123e4805157348d24e80220 Submitted crossbow builds: ursa-labs/crossbow @ ursabot-423
|
This reverts commit 57afe6f96b2441a15620f27e69107c3092a43325.
This reverts commit 8c2fafcd3d9bace465f964b228fa7ece9f20f4da.
a2d7372 to
7d32bc1
Compare
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @kou, I've made the changes and rebased, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rebased and removed this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
+1 |
|
😅 thanks all! |
With this patch,
install.packages("arrow")on Linux should now result in a functioning R package. On installation, the Rconfigurescript will, if Arrow C++ is not found locally, attempt to download a prebuilt static C++ library that corresponds to the local OS and the R package version; if not found, it will download the C++ source (or look for it in a local git checkout) and attempt to compile it. In this latter case, installation will be slow, but it should work--the script inr/inst/build_arrow_static.shmakes a bundled, static build. Any additional build-time dependencies (cmake, flex and bison for thrift) will be downloaded and built if necessary. If all of this fails, then the current "arrow without arrow" package is built.The
rdocker-compose service this patch adds is set up such that we can test against any Docker image containing R on Docker Hub. The GitHub Actions workflow and crossbow nightly tasks added here test the C++ source building on 7 distro/versions. This testing uncovered a few sharp corners in the bundled static cmake build, which have been noted/ticketed/fixed.I summarized the installation workflow and the ways to control and debug it in a vignette:
r/vignettes/install.Rmd. That document also points to the code/scripts that do the work.To support this workflow and provide nightly builds that will prove that it works, I am separately setting up some CI at https://github.com/ursa-labs/arrow-r-nightly, where we already build nightly macOS and Windows binary packages. This CI
https://dl.bintray.com/ursa-labs/arrow-r/libarrow/bin/$DISTRO-$OS_VERSION/arrow-$PKG_VERSION.zip.https://dl.bintray.com/ursa-labs/arrow-r/libarrow/src/arrow-$PKG_VERSION.zip. If not found for $PKG_VERSION,configurewill then look to download the official Apache Arrow source release. This "fallback" will enable CRAN releases to work without the Ursa bintray, while having it first look to bintray will allow us to patch the source if necessary after the official Apache release, which has been necessary for past R releases (and permitted since R packages are not officially voted on).To be clear, the changes in this patch do not require the existence of that external build infrastructure. Without the nightly/binary builds, this patch still allows
R CMD INSTALLfrom inside the git repository to Just Work, and future CRAN releases should also download the official Apache Arrow source release and build the C++ from source. So at least that gets our users a simple (albeit slow) installation experience, and one that works out of the box from a git checkout. The binary builds will improve the installation experience when they exist, and this patch includes the hooks to find and use those prebuilt libraries.Note that the intent is that this extra C++ installation/compilation will not happen on CRAN itself, only when users install the package themselves. We can relax that later once we have more confidence that the binaries we build will work widely, but restricting it on CRAN does not affect the experience of users who install the package.
Aside: the original idea (discussed on Jira) of building a single "manylinux" Arrow C++ library to use with the R package proved not to work.