Skip to content

Conversation

@briandconnelly
Copy link

@briandconnelly briandconnelly commented Mar 8, 2018

Make sure you have checked all steps below.

JIRA

Description

  • Here are some details about my PR, including screenshots of any UI changes:

The ROperator allows tasks to be specified using the R programming
language for statistical computing. Tasks can be specified either
as R operations or using a source file. Optionally, the last line of
output can be pushed to an Xcom.

ROperator requires rpy2 (and R)

Data can be passed to R using either the environment or through
templating.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Several tests are specified in tests/contrib/operators/test_r_operator.py, which I cover most of the exposed features

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":

    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"
  • Passes git diff upstream/master -u -- "*.py" | flake8 --diff

The ROperator allows tasks to be specified using the R programming
language for statistical computing. Tasks can be specified either
as R operations or using a source file. Optionally, the last line of
output can be pushed to an Xcom.

ROperator requires that R be installed. By default, it uses the
Rscript interpreter, but littler has also been tested. If Rscript
is not in PATH, the full path can be specified as argument to
`rscript_bin`.

Data can be passed to R using either the environment or through
templating.
@briandconnelly
Copy link
Author

Looks like the tests are failing because R is not installed (missing Rscript command), which was expected.

I'm happy to take suggestions as to how to deal with this. The Operator is heavily inspired by BashOperator, and I noticed that it doesn't have tests.

@jgao54
Copy link

jgao54 commented Mar 10, 2018

@briandconnelly thank you for your contribution!

I recommend using a python library for R (i.e. rpy2) to avoid using RScript to execute R commands, that way the dependency can be managed in the setup script (I believe rpy2 is also compatible with python's pandas library, which makes it easy to expand into other R operators in the future if needed).

I'm not a huge fan of having operators to have hard dependency on shell scripts (which could simply be achieved with a BashOperator, and it's rather fragile since there are no version control)

@briandconnelly
Copy link
Author

@jgao54 I agree. I had originally used rpy2, but discovered that people often have issues compiling it.

Still, it makes more sense in this context. I'll update that version and the PR. Thanks for the helpful feedback!

@benjamingregory
Copy link

@briandconnelly still working on this? Happy to help get it over the line if you're tied up elsewhere.

@briandconnelly
Copy link
Author

@benjamingregory feel free to take a crack at it! I kept running into issues with getting rpy2 set up easily, and it's kind of been on the back burner since then.

@benjamingregory
Copy link

Sounds good @briandconnelly. I'll see what I can do :)

@briandconnelly
Copy link
Author

briandconnelly commented May 8, 2018

@benjamingregory just did a quick test where I replaced the Popen part in the version in the PR with robjects.r.source(fname) (from rpy2). It seems to work fine with basic commands and files, so it might be an easy fix.

Brian Connelly added 3 commits May 8, 2018 11:16
The ROperator allows tasks to be specified using the R programming
language for statistical computing. Tasks can be specified either
as R operations or using a source file. Optionally, the last line of
output can be pushed to an Xcom.

ROperator requires that R be installed. By default, it uses the
Rscript interpreter, but littler has also been tested. If Rscript
is not in PATH, the full path can be specified as argument to
`rscript_bin`.

Data can be passed to R using either the environment or through
templating.
@benjamingregory
Copy link

Awesome @briandconnelly! I'm kinda swamped the next couple of days but I could test it this weekend and see if I can find a way to break it :)

@ricardo-bion
Copy link

Hey! Any updates here @benjamingregory and @briandconnelly?

@briandconnelly
Copy link
Author

@ricardo-bion I've tested the rpy2-based operator a few times with success, but haven't had much time lately to test it further or make test cases for it.

@benjamingregory
Copy link

Sorry for the radio silence on my end. Wasn't able to test way back in and fell to the wasteland of my to-do unfortunately.

Completely booked for the next two weeks so anyone else feel free to jump in if available for testing.

@jensenity
Copy link

Hi, just curious about this issue. Is anyone testing on it right now?

@briandconnelly
Copy link
Author

@jensenity not at the moment, but the ROperator that's in the branch should be good for testing if you'd like to give it a try. Please let me know how it does if you test it out.

@stale
Copy link

stale bot commented Dec 10, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 10, 2018
@stale stale bot closed this Dec 18, 2018
@CerebralMastication
Copy link

Did this move on or die on the vine? I'm seeing quite a bit of chatter in the R community about Airflow and folks wishing they had tighter integration.

@ashb
Copy link
Member

ashb commented Feb 12, 2019

@CerebralMastication it did. I don't do much R, but does this PR look sensible to you? Would it do what you wanted of it?

@ashb ashb reopened this Feb 12, 2019
@stale stale bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 12, 2019
@CerebralMastication
Copy link

I can think of some use cases where I would totally use this. I have others that I'll probably leave in a bash operator. But having a little tighter R integration could be really interesting.

@briandconnelly
Copy link
Author

@CerebralMastication I've intended to get this moving forward again for a while now, but haven't had the time.

The code itself works just fine. I've had DAGs using it for quite a while. Getting some good tests in there has been a challenge.

@matkalinowski
Copy link

Are there any chances of merging this pull request in near future?

@zhongjiajie
Copy link
Member

@OmerJog Error because our CI env have no R package install. @matkalinowski Will be merge when the CI pass and review by committer

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we stub/mock robjects.r.source (and maybe more?) so that we don't need an working R environment installed on CI?

def __init__(
self,
r_command,
env={},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable defaults should be avoided in Python

Suggested change
env={},
env=None,

and then in the function

    self.env = env or {}

@zhongjiajie
Copy link
Member

@ashb That maybe a good idea, but I don't know. And I try to change Airflow-ci docker image several times months before but failed.

@matkalinowski
Copy link

@OmerJog Error because our CI env have no R package install. @matkalinowski Will be merge when the CI pass and review by committer

I have created pull request on image repository, hope it helps.

@ashb
Copy link
Member

ashb commented Jun 12, 2019

As I mentioned in the other PR - I really want to avoid running R in the Airflow unit tests - we don't want to test R or the bindings (as that is the job of the other projects test suites), just that we call it as expected.

@ldfreight
Copy link

ldfreight commented Oct 1, 2019

any update on this?
We are also interested in this feature.
Why is integrating R different from any other operator?

@ashb
Copy link
Member

ashb commented Oct 1, 2019

@ldfreight The tests need changing as I requested is all. (For example we don't run spark from the spark operator tests either)

@stale
Copy link

stale bot commented Nov 20, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 20, 2019
@stale stale bot closed this Nov 27, 2019
@gnatamania
Copy link

so... is this the end of my hopes and dreams of seeing the ROperator in Airflow? 😢

@potiuk
Copy link
Member

potiuk commented Feb 12, 2020

Hmm. @gnatamania Unless you find someone who can pick this up @briandconnelly -> maybe you still want to continue that? Or maybe you can pick it up yourself @gnatamania ? I can reopen it if needed any time :)

@briandconnelly
Copy link
Author

Yeah, I don't plan on pursuing this further. Building and maintaining the test framework for it is complicated, and I just don't have the time for it. But! @gnatamania, I'd say you have three options:

  • If you have rpy2 working, make a copy of r_operator.py from this PR and drop it into your projects
  • Use BashOperator and call Rscript to run your R code
  • Run R in Docker containers (see rocker) and use the DockerOperator

I've used all three options and actually prefer the latter two.

@zhongjiajie
Copy link
Member

I have a plan to make r operator in Airflow, but delay due to Chinese Spring Festival and China COVID-19

@dpmccabe
Copy link

Worth mentioning https://github.com/ropensci/drake/ as an alternative for people who are building data pipelines in R. Not the best option in a mixed-language environment, but if your whole pipeline is in R it's a great option.

@edgBR
Copy link

edgBR commented Jun 23, 2020

Hi,

Is this still ongoing? @dpmccabe is right about drake but drake does not offer schedulling at all.

Having a proper R operator in airflow would simplify quite a lot of things.

BR
/E

@briandconnelly
Copy link
Author

@edgBR I have no plans to resume working on this. Using either BashOperator to run R code via Rscript or DockerOperator to run R code in a container works well.

@KarthikRajashekaran
Copy link

KarthikRajashekaran commented Dec 20, 2022

Do we have an ROperator with Airflow >2 version ? @briandconnelly Any reason why this ROperator was closed and not committed to Airflow? Can it be used for local airflow running RScripts?

@mkaja
Copy link

mkaja commented Feb 16, 2023

do we have any update on Roperator with airflow>2 version ?

@ssefick
Copy link

ssefick commented May 31, 2023

Any reason this has died on the vine?

@ssefick
Copy link

ssefick commented May 31, 2023

This seems to be an infrastructure problem with the python test suit?

@dwells-capstone
Copy link

Do we have any update on the R-operator? I have a use for this operator and hope that it can be moved forward.

@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

Generaly if no-one contributes it, it will not happen. Airflow is created by 2700 contributors and the most certain way to get something that you care about is to a) contribute b) pay someone to contribute otherwise you generally have to wait for someone to c) voluntarily contribute it.

This is how Open Source project and contributions work.

So @KarthikRajashekaran @mkaja @ssefick @dwells-capstone - since you have an interest in this you are the prime candidates to make it faster. If there is no-one else who will do it - you are the driving force there.

@paid-geek
Copy link

@potiuk I want to pick up this effort on the ROperator as I need to develop one. Anyway, what is the process for restarting work on a closed issue like this one? To be clear, I will be undertaking this work anyway, so I want to make sure my efforts contribute to the Airflow software that I use daily.

@potiuk
Copy link
Member

potiuk commented Apr 20, 2024

Just start PR. There is no need to have issue opened for PR to be reviewed/merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.