-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-9247] Integrate GCP Vision API functionality #10959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Has there been a discussion of whether this (rather specialized) kind of transform belongs in beam vs. as a separate library built on top of beam? Also, is the synchronous/batching pattern something that would be generally useful, and could be pulled out, or is it really integral to these transforms and contexts? |
|
Most of the discussion was done in this PR: #10764 and the over-arching Jira ticket is here https://issues.apache.org/jira/browse/BEAM-9145 I agree that these transforms are slightly different as they call an external API and lets that do most of the heavy-lifting processing. The batching pattern implemented here is quite specific to these transforms. Not sure how applicable it would be to pull out. |
There's been a discussion on devlist some time ago: here. General conclusion was that it's fine to put Cloud AI transforms directly into Beam. Also, we have similar transforms already implemented: https://github.com/apache/beam/tree/master/sdks/python/apache_beam/ml/gcp |
|
I wonder whether it makes sense to support async (offline) annotation from Beam's perspective. Let's suppose that we don't return anything and On the other hand, the main advantage of sync (online) annotation is that it returns results relatively fast for further processing. |
From my perspective, there should be no problem in sending a request containing up to 5 files to be annotated, then waiting for the result and sending another request. |
I agree that those sort of tasks are more suited for e.g. Airflow, and that was something that crossed my mind too. If you think that batches of <=5 items are useful, then perhaps we should go with sync (online) batch annotation rather than async. But I guess the question then is how much use (or more efficient) a batch of 5 image annotations is compared to 5 separate annotations. |
I don't see any obstacles. Also take a look at |
|
Cheers for the tip about |
|
Cool, nice to hear that! One more thing that came to my mind. |
|
Overall direction LGTM. One comment on the setup.py
|
|
Good guess! Indeed, looking at the source code for I'll need to keep the min_batch_size and max_batch_size parameters for testing purposes, in order to get a valid count of the API calls as otherwise the number of calls is dependent on the BatchElements transform. But I'll make sure to document these two parameters as such. |
|
@aaltay I noticed an issue in I also updated |
| class VisionTest(unittest.TestCase): | ||
| def setUp(self): | ||
| self._mock_client = mock.Mock() | ||
| self.m2 = mock.Mock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mock seems to be unused
| query_result = result.metrics().query(read_filter) | ||
| if query_result['counters']: | ||
| read_counter = query_result['counters'][0] | ||
| self.assertTrue(read_counter.committed == expected_counter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safer to use result property rather than committed, because when committed is empty, attempted is being used. See the code:
beam/sdks/python/apache_beam/metrics/execution.py
Lines 129 to 134 in f9f3159
| @property | |
| def result(self): | |
| """Short-hand for falling back to attempted metrics if it seems that | |
| committed was not populated (e.g. due to not being supported on a given | |
| runner""" | |
| return self.committed if self.committed else self.attempted |
|
LGTM. I've left two small comments, but consider them as non-blockers. Thanks for merging those PTransform I mentioned and for documenting batch_size parameters. |
|
Thanks for the thorough review! |
|
retest this please |
This PR refers to https://issues.apache.org/jira/browse/BEAM-9247: [Python] PTransform that integrates Cloud Vision functionality.
Small comments:
The synchronous annotation is very similar to the videointelligence implementation.
I opted to only support async (offline) batch annotation:
I'm still working on a few kinks and issues with the batch annotation as well as making sure all tests run properly, so consider this a draft PR for now.PR is now finalised.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.