Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Dec 19, 2019

With this patch, install.packages("arrow") on Linux should now result in a functioning R package. On installation, the R configure script will, if Arrow C++ is not found locally, attempt to download a prebuilt static C++ library that corresponds to the local OS and the R package version; if not found, it will download the C++ source (or look for it in a local git checkout) and attempt to compile it. In this latter case, installation will be slow, but it should work--the script in r/inst/build_arrow_static.sh makes a bundled, static build. Any additional build-time dependencies (cmake, flex and bison for thrift) will be downloaded and built if necessary. If all of this fails, then the current "arrow without arrow" package is built.

The r docker-compose service this patch adds is set up such that we can test against any Docker image containing R on Docker Hub. The GitHub Actions workflow and crossbow nightly tasks added here test the C++ source building on 7 distro/versions. This testing uncovered a few sharp corners in the bundled static cmake build, which have been noted/ticketed/fixed.

I summarized the installation workflow and the ways to control and debug it in a vignette: r/vignettes/install.Rmd. That document also points to the code/scripts that do the work.

To support this workflow and provide nightly builds that will prove that it works, I am separately setting up some CI at https://github.com/ursa-labs/arrow-r-nightly, where we already build nightly macOS and Windows binary packages. This CI

  • Builds static binaries for some set of distribution-versions (e.g. "ubuntu-18.04", "centos-7", etc.) and host them at https://dl.bintray.com/ursa-labs/arrow-r/libarrow/bin/$DISTRO-$OS_VERSION/arrow-$PKG_VERSION.zip.
  • Hosts C++ source snapshots corresponding to the R package versions (both CRAN releases and nightlies) at https://dl.bintray.com/ursa-labs/arrow-r/libarrow/src/arrow-$PKG_VERSION.zip. If not found for $PKG_VERSION, configure will then look to download the official Apache Arrow source release. This "fallback" will enable CRAN releases to work without the Ursa bintray, while having it first look to bintray will allow us to patch the source if necessary after the official Apache release, which has been necessary for past R releases (and permitted since R packages are not officially voted on).
  • Tests that those work.

To be clear, the changes in this patch do not require the existence of that external build infrastructure. Without the nightly/binary builds, this patch still allows R CMD INSTALL from inside the git repository to Just Work, and future CRAN releases should also download the official Apache Arrow source release and build the C++ from source. So at least that gets our users a simple (albeit slow) installation experience, and one that works out of the box from a git checkout. The binary builds will improve the installation experience when they exist, and this patch includes the hooks to find and use those prebuilt libraries.

Note that the intent is that this extra C++ installation/compilation will not happen on CRAN itself, only when users install the package themselves. We can relax that later once we have more confidence that the binaries we build will work widely, but restricting it on CRAN does not affect the experience of users who install the package.

Aside: the original idea (discussed on Jira) of building a single "manylinux" Arrow C++ library to use with the R package proved not to work.

@github-actions
Copy link

@kou kou self-requested a review December 20, 2019 22:44
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed but don't try this yet. Sorry.

Copy link
Member Author

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @kou! A few responses.

Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpp/arrow-0.15.1.9999.zip file should not be checked in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you install/build cmake, bison, and flex from R instead of in build_arrow_static.sh?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Convenience: for example, I know that download.file() will work, but I don't know if the system has wget.
(2) IMO cmake should handle everything but cmake (https://issues.apache.org/jira/browse/ARROW-7501) so I'd be happy to delete this code then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it requires the distribution token in the path? I foresee a problem with invoking multiple docker images on the same master, e.g. on a developer machine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's arbitrary, but if I change it here, I have to change it in the configure script too, and that would mean I'd need to determine the distro/version in the configure script too, and personally I'd prefer to avoid rewriting that logic in bash/sed.

Can we wait to see if what you're fearing is a problem? If it does happen, the package installation should fail pretty fast, and deleting the previously built dir is a quick fix.

Copy link
Member Author

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fsaintjacques, some responses. I'll delete that file that I accidentally checked in, thanks for catching that!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's arbitrary, but if I change it here, I have to change it in the configure script too, and that would mean I'd need to determine the distro/version in the configure script too, and personally I'd prefer to avoid rewriting that logic in bash/sed.

Can we wait to see if what you're fearing is a problem? If it does happen, the package installation should fail pretty fast, and deleting the previously built dir is a quick fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Convenience: for example, I know that download.file() will work, but I don't know if the system has wget.
(2) IMO cmake should handle everything but cmake (https://issues.apache.org/jira/browse/ARROW-7501) so I'd be happy to delete this code then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker is not meaningful in this context. We have an ubuntu-r and another conda-r build, this should align too. How does it differ? Could it replace either one?

According to your comments this should work on any base image with R installed, so we should capture that information somehow. Seemingly rhub provides the compatible base images, so perhaps we should call the docker service as rhub-r instead of r-docker and hardcode the R_DOCKER_ORG to rhub.

I also find the three env variables a bit tedious, how about hardcoding R_DOCKER_ORG as I mentioned above and merge R_DOCKER_IMAGE and R_DOCKER_TAG into RHUB_IMAGE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be ok with deleting the ubuntu-r job; technically that tests something slightly different (building the C++ lib and passing its location to R installation with LIB_DIR/INCLUDE_DIR) but honestly these new jobs test more things that we aren't covering elsewhere (operating systems, static builds)

I don't want to hard-code rhub because I'm not using only R images from the rhub organization. I'm already using images from both rhub and rstudio, and there are other Docker Hub orgs out there that we could draw from, if it turns out that we want/need to expand the build matrix.

I agree that the three env variables are tedious and would be happy to use a single R_DOCKER=rhub/ubuntu-gcc-release:latest var. But see L672 and 677: that would result in an invalid image name of apache/arrow-dev:r-rhub/ubuntu-gcc-release:latest. I considered a few alternatives and this seemed to be the least bad. Plus, it's consistent with the other jobs, which have 2-3 env vars that parametrize them. The defaults are set in .env, just like they are for ARCH and UBUNTU et al.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, how about naming it as r simply then instead of r-docker ? :)

docker-compose build r vs docker-compose build r-docker

r/configure Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a follow-up jira, we should probably have a jira for each todo notes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, I have a few other features to ticket for the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the number I would not run all of these builds on each PR commit, although its feedback seems pretty useful, so I'd pick a single build (a relevant one, perhaps ubuntu-gcc-release) and configure it in workflows/r.yml. This way we'd have a build running on each PR commit possibly catching valid issues.

The rest of the builds should run on either push to master or as a cron schedule. @kou opinion?
If want to the remaining builds as cron jobs, then we should place them to r_cron.yml, otherwise they should go to r.yml using the if: github.event_name == 'push' condition for the job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can replace the ubuntu-r job in r.yml with one of these, sounds good.

IMO push to master sounds good, until we start seeing GHA backups and need to throttle it more. The trouble with running on master/cron is: how do we get alerted about failures? IIUC we aren't checking the existing cron GHA jobs for failures to include in our daily crossbow email.

If we're not running these on PR commit, then they should be wired in crossbow for on-demand use in a PR review, right? Do we have similar setup for the existing GHA cron jobs?

Copy link
Member

@kszucs kszucs Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyone subscribed to the apache/arrow repository's notifications should receive build error notifications for the cron jobs. For example I'm getting a couple ones per day for my fork.

That wiring is manual, we have crossbow jobs defined for most of the docker-compose builds regardless whether we run them as GHA jobs or not. Which means that we can trigger docker builds with comments, even the ones we regularly run on GHA.

To summarize, you're free to choose where to run these docker builds, either via crossbow nightlies (after configuring them tasks.yml) for better notifications, of via GHA actions.

Copy link
Member Author

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kszucs !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be ok with deleting the ubuntu-r job; technically that tests something slightly different (building the C++ lib and passing its location to R installation with LIB_DIR/INCLUDE_DIR) but honestly these new jobs test more things that we aren't covering elsewhere (operating systems, static builds)

I don't want to hard-code rhub because I'm not using only R images from the rhub organization. I'm already using images from both rhub and rstudio, and there are other Docker Hub orgs out there that we could draw from, if it turns out that we want/need to expand the build matrix.

I agree that the three env variables are tedious and would be happy to use a single R_DOCKER=rhub/ubuntu-gcc-release:latest var. But see L672 and 677: that would result in an invalid image name of apache/arrow-dev:r-rhub/ubuntu-gcc-release:latest. I considered a few alternatives and this seemed to be the least bad. Plus, it's consistent with the other jobs, which have 2-3 env vars that parametrize them. The defaults are set in .env, just like they are for ARCH and UBUNTU et al.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can replace the ubuntu-r job in r.yml with one of these, sounds good.

IMO push to master sounds good, until we start seeing GHA backups and need to throttle it more. The trouble with running on master/cron is: how do we get alerted about failures? IIUC we aren't checking the existing cron GHA jobs for failures to include in our daily crossbow email.

If we're not running these on PR commit, then they should be wired in crossbow for on-demand use in a PR review, right? Do we have similar setup for the existing GHA cron jobs?

@nealrichardson nealrichardson force-pushed the r-manylinux branch 2 times, most recently from 2e6e1d6 to 77cf605 Compare January 9, 2020 22:00
@nealrichardson
Copy link
Member Author

@ursabot crossbow submit -g r

@ursabot
Copy link

ursabot commented Jan 9, 2020

Invalid value for "--group" / "-g": invalid choice: r. (choose from docker, integration, cpp-python, conda, wheel, linux, gandiva, nightly)

@nealrichardson
Copy link
Member Author

@ursabot crossbow submit test-r-rhub-ubuntu-gcc-release test-r-rstudio-r-base-3.6-bionic test-r-rstudio-r-base-3.6-centos6 test-r-rstudio-r-base-3.6-opensuse15 test-r-rstudio-r-base-3.6-opensuse42

@ursabot
Copy link

ursabot commented Jan 9, 2020

AMD64 Conda Crossbow Submit (#85664) builder failed.

Revision: 0b1bc93f0f92ca9d8eb1c80b5f470c785b0ff61b

Crossbow: 'python crossbow.py ...' step's stderr:

Traceback (most recent call last):
  File "crossbow.py", line 1305, in <module>
    crossbow(obj={}, auto_envvar_prefix='CROSSBOW')
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/decorators.py", line 27, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "crossbow.py", line 1165, in submit
    queue.put(job, prefix=job_prefix)
  File "crossbow.py", line 600, in put
    files = task.render_files(job=job, arrow=job.target, queue=self)
  File "crossbow.py", line 697, in render_files
    rendered = template.render(task=self, **params)
  File "/opt/conda/lib/python3.7/site-packages/jinja2/asyncsupport.py", line 76, in render
    return original_render(self, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/jinja2/environment.py", line 1008, in render
    return self.environment.handle_exception(exc_info, True)
  File "/opt/conda/lib/python3.7/site-packages/jinja2/environment.py", line 780, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/conda/lib/python3.7/site-packages/jinja2/_compat.py", line 37, in reraise
    raise value.with_traceback(tb)
  File "<template>", line 61, in top-level template code
jinja2.exceptions.UndefinedError: 'wheel_tag' is undefined

@nealrichardson
Copy link
Member Author

@ursabot crossbow submit test-r-rhub-ubuntu-gcc-release test-r-rstudio-r-base-3.6-bionic test-r-rstudio-r-base-3.6-centos6 test-r-rstudio-r-base-3.6-opensuse15 test-r-rstudio-r-base-3.6-opensuse42

@ursabot
Copy link

ursabot commented Jan 9, 2020

AMD64 Conda Crossbow Submit (#85673) builder has been succeeded.

Revision: 3a0abf03c07473a84123e4805157348d24e80220

Submitted crossbow builds: ursa-labs/crossbow @ ursabot-423

Task Status
test-r-rhub-ubuntu-gcc-release Azure
test-r-rstudio-r-base-3.6-bionic Azure
test-r-rstudio-r-base-3.6-centos6 Azure
test-r-rstudio-r-base-3.6-opensuse15 Azure
test-r-rstudio-r-base-3.6-opensuse42 Azure

Copy link
Member Author

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @kou, I've made the changes and rebased, PTAL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rebased and removed this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kou
Copy link
Member

kou commented Jan 11, 2020

+1

@nealrichardson
Copy link
Member Author

😅 thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants