-
Notifications
You must be signed in to change notification settings - Fork 16.4k
[AIRFLOW-2193] Add ROperator for using R #3115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The ROperator allows tasks to be specified using the R programming language for statistical computing. Tasks can be specified either as R operations or using a source file. Optionally, the last line of output can be pushed to an Xcom. ROperator requires that R be installed. By default, it uses the Rscript interpreter, but littler has also been tested. If Rscript is not in PATH, the full path can be specified as argument to `rscript_bin`. Data can be passed to R using either the environment or through templating.
|
Looks like the tests are failing because R is not installed (missing Rscript command), which was expected. I'm happy to take suggestions as to how to deal with this. The Operator is heavily inspired by BashOperator, and I noticed that it doesn't have tests. |
|
@briandconnelly thank you for your contribution! I recommend using a python library for R (i.e. rpy2) to avoid using RScript to execute R commands, that way the dependency can be managed in the setup script (I believe rpy2 is also compatible with python's pandas library, which makes it easy to expand into other R operators in the future if needed). I'm not a huge fan of having operators to have hard dependency on shell scripts (which could simply be achieved with a BashOperator, and it's rather fragile since there are no version control) |
|
@jgao54 I agree. I had originally used rpy2, but discovered that people often have issues compiling it. Still, it makes more sense in this context. I'll update that version and the PR. Thanks for the helpful feedback! |
|
@briandconnelly still working on this? Happy to help get it over the line if you're tied up elsewhere. |
|
@benjamingregory feel free to take a crack at it! I kept running into issues with getting rpy2 set up easily, and it's kind of been on the back burner since then. |
|
Sounds good @briandconnelly. I'll see what I can do :) |
|
@benjamingregory just did a quick test where I replaced the |
The ROperator allows tasks to be specified using the R programming language for statistical computing. Tasks can be specified either as R operations or using a source file. Optionally, the last line of output can be pushed to an Xcom. ROperator requires that R be installed. By default, it uses the Rscript interpreter, but littler has also been tested. If Rscript is not in PATH, the full path can be specified as argument to `rscript_bin`. Data can be passed to R using either the environment or through templating.
|
Awesome @briandconnelly! I'm kinda swamped the next couple of days but I could test it this weekend and see if I can find a way to break it :) |
|
Hey! Any updates here @benjamingregory and @briandconnelly? |
|
@ricardo-bion I've tested the rpy2-based operator a few times with success, but haven't had much time lately to test it further or make test cases for it. |
|
Sorry for the radio silence on my end. Wasn't able to test way back in and fell to the wasteland of my to-do unfortunately. Completely booked for the next two weeks so anyone else feel free to jump in if available for testing. |
|
Hi, just curious about this issue. Is anyone testing on it right now? |
|
@jensenity not at the moment, but the |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
Did this move on or die on the vine? I'm seeing quite a bit of chatter in the R community about Airflow and folks wishing they had tighter integration. |
|
@CerebralMastication it did. I don't do much R, but does this PR look sensible to you? Would it do what you wanted of it? |
|
I can think of some use cases where I would totally use this. I have others that I'll probably leave in a bash operator. But having a little tighter R integration could be really interesting. |
|
@CerebralMastication I've intended to get this moving forward again for a while now, but haven't had the time. The code itself works just fine. I've had DAGs using it for quite a while. Getting some good tests in there has been a challenge. |
|
Are there any chances of merging this pull request in near future? |
|
@OmerJog Error because our CI env have no R package install. @matkalinowski Will be merge when the CI pass and review by committer |
ashb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we stub/mock robjects.r.source (and maybe more?) so that we don't need an working R environment installed on CI?
| def __init__( | ||
| self, | ||
| r_command, | ||
| env={}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mutable defaults should be avoided in Python
| env={}, | |
| env=None, |
and then in the function
self.env = env or {}
|
@ashb That maybe a good idea, but I don't know. And I try to change Airflow-ci docker image several times months before but failed. |
I have created pull request on image repository, hope it helps. |
|
As I mentioned in the other PR - I really want to avoid running R in the Airflow unit tests - we don't want to test R or the bindings (as that is the job of the other projects test suites), just that we call it as expected. |
|
any update on this? |
|
@ldfreight The tests need changing as I requested is all. (For example we don't run spark from the spark operator tests either) |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
so... is this the end of my hopes and dreams of seeing the ROperator in Airflow? 😢 |
|
Hmm. @gnatamania Unless you find someone who can pick this up @briandconnelly -> maybe you still want to continue that? Or maybe you can pick it up yourself @gnatamania ? I can reopen it if needed any time :) |
|
Yeah, I don't plan on pursuing this further. Building and maintaining the test framework for it is complicated, and I just don't have the time for it. But! @gnatamania, I'd say you have three options:
I've used all three options and actually prefer the latter two. |
|
I have a plan to make r operator in Airflow, but delay due to Chinese Spring Festival and China COVID-19 |
|
Worth mentioning https://github.com/ropensci/drake/ as an alternative for people who are building data pipelines in R. Not the best option in a mixed-language environment, but if your whole pipeline is in R it's a great option. |
|
Hi, Is this still ongoing? @dpmccabe is right about drake but drake does not offer schedulling at all. Having a proper R operator in airflow would simplify quite a lot of things. BR |
|
@edgBR I have no plans to resume working on this. Using either |
|
Do we have an ROperator with Airflow >2 version ? @briandconnelly Any reason why this ROperator was closed and not committed to Airflow? Can it be used for local airflow running RScripts? |
|
do we have any update on Roperator with airflow>2 version ? |
|
Any reason this has died on the vine? |
|
This seems to be an infrastructure problem with the python test suit? |
|
Do we have any update on the R-operator? I have a use for this operator and hope that it can be moved forward. |
|
Generaly if no-one contributes it, it will not happen. Airflow is created by 2700 contributors and the most certain way to get something that you care about is to a) contribute b) pay someone to contribute otherwise you generally have to wait for someone to c) voluntarily contribute it. This is how Open Source project and contributions work. So @KarthikRajashekaran @mkaja @ssefick @dwells-capstone - since you have an interest in this you are the prime candidates to make it faster. If there is no-one else who will do it - you are the driving force there. |
|
@potiuk I want to pick up this effort on the ROperator as I need to develop one. Anyway, what is the process for restarting work on a closed issue like this one? To be clear, I will be undertaking this work anyway, so I want to make sure my efforts contribute to the Airflow software that I use daily. |
|
Just start PR. There is no need to have issue opened for PR to be reviewed/merged. |
Make sure you have checked all steps below.
JIRA
Description
The ROperator allows tasks to be specified using the R programming
language for statistical computing. Tasks can be specified either
as R operations or using a source file. Optionally, the last line of
output can be pushed to an Xcom.
ROperator requires rpy2 (and R)
Data can be passed to R using either the environment or through
templating.
Tests
Several tests are specified in
tests/contrib/operators/test_r_operator.py, which I cover most of the exposed featuresCommits
My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
Passes
git diff upstream/master -u -- "*.py" | flake8 --diff