-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Fix python examples tests not running in Dataflow #23546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Run Python Examples_Direct |
|
Run Python Examples_Dataflow |
|
@tvalentyn could you help me to approve running workflows to test my fixes? |
|
Run Python Examples_Flink |
|
Run Python Examples_Spark |
Codecov Report
@@ Coverage Diff @@
## master #23546 +/- ##
==========================================
- Coverage 73.46% 73.09% -0.37%
==========================================
Files 718 729 +11
Lines 95884 98231 +2347
==========================================
+ Hits 70438 71799 +1361
- Misses 24135 25121 +986
Partials 1311 1311
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
Run Python Examples_Spark |
|
@tvalentyn I've fixed some of the examples so those can run in Dataflow, but a couple of them are failing https://ci-beam.apache.org/job/beam_PostCommit_Python_Examples_Dataflow_PR/20/ not sure why those are having differences in the assertions, do you have some insight to fix them easily or do you think it is better to sickbay those for the Dataflow suite and fill a new issue? |
|
Filing issues and sickbaying sounds good |
|
thank you |
|
Run Python Examples_Dataflow |
|
Assigning reviewers. If you would like to opt out of this review, comment R: @TheNeuralBit for label python. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
TheNeuralBit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I have a few suggestions
| try: | ||
| from apache_beam.io.gcp import gcsio | ||
| except ImportError: | ||
| gcsio = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually protect us? It looks like this would just change the error to a more confusing one: None has no attribure GcsIO, when gcsio is used. I think just letting the ImportError raise would be preferable.
Alternatively we could add a skipIf(gcsio is None), but that might lead to us unintentionally skipping it indefinitely.
| logging.info('Creating file: %s', path) | ||
| gcs = gcsio.GcsIO() | ||
| with gcs.open(path, 'w') as f: | ||
| f.write(str.encode(contents, 'utf-8')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it would be better if these utilities used the Filesystems API, see here for an example:
beam/sdks/python/apache_beam/examples/dataframe/taxiride_it_test.py
Lines 93 to 100 in 45cc085
| def read_csv(path): | |
| with FileSystems.open(path) as fp: | |
| return pd.read_csv(fp) | |
| result = pd.concat( | |
| read_csv(metadata.path) for metadata in FileSystems.match( | |
| [f'{self.output_path}*'])[0].metadata_list) | |
| result = result.sort_values('Borough').reset_index(drop=True) |
It would also be good to extract these out into testing.utils rather than copying them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @TheNeuralBit, I applied some of your suggestions
|
Run Python Examples_Dataflow |
|
Run Python Examples_Direct |
|
Reminder, please take a look at this pr: @TheNeuralBit |
|
Run Python PreCommit |
1 similar comment
|
Run Python PreCommit |
|
Run Python 3.8 PostCommit |
|
Thanks, @TheNeuralBit! |
* Fix tests for examples not running in Dataflow * Remove unused test * Add todos to enable test for Dataflow * Refactor utilities functions to create and read files * Fix lint errors * Fix lint errors and skip tests that require gcsio and is not available * Refactor read file function and remove gcsio dependency
Resolves #22983
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.