[BEAM-9247] Integrate GCP Vision API functionality #10959

EDjur · 2020-02-25T08:36:22Z

This PR refers to https://issues.apache.org/jira/browse/BEAM-9247: [Python] PTransform that integrates Cloud Vision functionality.

Small comments:
The synchronous annotation is very similar to the videointelligence implementation.

I opted to only support async (offline) batch annotation:

Synchronous batch annotation only supports batching <=5 elements at a time, so is not very useful (https://cloud.google.com/vision/docs/batch).
Synchronous batch annotation also doesn't make a lot of sense for Beam, as the CPU/workers would just sit around and wait until a response is received from the Vision API, which seems inefficient.

~~I'm still working on a few kinks and issues with the batch annotation as well as making sure all tests run properly, so consider this a draft PR for now.~~ PR is now finalised.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

…ion API

EDjur · 2020-02-25T08:40:04Z

R: @aaltay @kamilwu

robertwb · 2020-02-25T09:19:20Z

Has there been a discussion of whether this (rather specialized) kind of transform belongs in beam vs. as a separate library built on top of beam?

Also, is the synchronous/batching pattern something that would be generally useful, and could be pulled out, or is it really integral to these transforms and contexts?

EDjur · 2020-02-25T09:27:12Z

Most of the discussion was done in this PR: #10764 and the over-arching Jira ticket is here https://issues.apache.org/jira/browse/BEAM-9145

I agree that these transforms are slightly different as they call an external API and lets that do most of the heavy-lifting processing.

The batching pattern implemented here is quite specific to these transforms. Not sure how applicable it would be to pull out.

sdks/python/apache_beam/ml/gcp/visionml.py

kamilwu · 2020-02-25T12:09:17Z

Has there been a discussion of whether this (rather specialized) kind of transform belongs in beam vs. as a separate library built on top of beam?

There's been a discussion on devlist some time ago: here. General conclusion was that it's fine to put Cloud AI transforms directly into Beam. Also, we have similar transforms already implemented: https://github.com/apache/beam/tree/master/sdks/python/apache_beam/ml/gcp

kamilwu · 2020-02-25T13:08:52Z

I wonder whether it makes sense to support async (offline) annotation from Beam's perspective. Let's suppose that we don't return anything and AsyncBatchAnnotateImage essentially become a sink. In that case, Beam, as a data processing framework, doesn't provide much value. If all the transform does is just sending a request, there is no point in executing it on multiple Dataflow or Flink workers. We'd rather use a task orchestration tool, like Apache Airflow.

On the other hand, the main advantage of sync (online) annotation is that it returns results relatively fast for further processing.

kamilwu · 2020-02-25T13:09:35Z

Synchronous batch annotation only supports batching <=5 elements at a time, so is not very useful

From my perspective, there should be no problem in sending a request containing up to 5 files to be annotated, then waiting for the result and sending another request.

EDjur · 2020-02-25T13:13:42Z

I wonder whether it makes sense to support async (offline) annotation from Beam's perspective. Let's suppose that we don't return anything and AsyncBatchAnnotateImage essentially become a sink. In that case, Beam, as a data processing framework, doesn't provide much value. If all the transform does is just sending a request, there is no point in executing it on multiple Dataflow or Flink workers. We'd rather use a task orchestration tool, like Apache Airflow.

On the other hand, the main advantage of sync (online) annotation is that it returns results relatively fast for further processing.

I agree that those sort of tasks are more suited for e.g. Airflow, and that was something that crossed my mind too.

If you think that batches of <=5 items are useful, then perhaps we should go with sync (online) batch annotation rather than async. But I guess the question then is how much use (or more efficient) a batch of 5 image annotations is compared to 5 separate annotations.

kamilwu · 2020-02-25T13:31:30Z

If you think that batches of <=5 items are useful, then perhaps we should go with sync (online) batch annotation rather than async.

I don't see any obstacles.

Also take a look at BatchElements built-in PTransform (to be found at sdks/python/apache_beam/transforms/util.py), which could be useful in our case. This transform consumes a PCollection of element type T and produces a PCollection of type List[T]. The maximum size of the list can be configured by setting a max_batch_size parameter. This would speed things up a bit: it's much better to send one request with 5 files than 5 separate requests.

EDjur · 2020-02-25T14:59:07Z

Cheers for the tip about BatchElements. Using that reduces code duplication quite a bit as we can offload the creation of the AnnotateImageRequest to an earlier step in the PTransform, leaving us with only one DoFn for the two Batch transforms.

kamilwu · 2020-02-25T16:13:43Z

Cool, nice to hear that!

One more thing that came to my mind.
Would it be possible to merge AnnotateImage and BatchAnnotateImage transforms? I think the user has little interest in configuring min_batch_size and max_batch_size, because, at the end, responses are flatten (output PCollection is of type AnnotateImageResponse, not List[AnnotateImageResponse]). Also, I expect that batch_annotate_images with len(requests) == 1 works the same way as annotate_images. But maybe I'm wrong.

aaltay · 2020-02-25T18:14:32Z

Overall direction LGTM. One comment on the setup.py

Do you need a lower limit?
<0.43.0 ? if we dont want to automatically pick up the next version.

EDjur · 2020-02-25T18:27:58Z

Overall direction LGTM. One comment on the setup.py

Do you need a lower limit?

<0.43.0 ? if we dont want to automatically pick up the next version.

Yes, sorry, I'm meaning to test out a few different versions below 0.42.0 before finalising the PR. But wanted to make sure that the PR was heading in the correct general direction before doing that.
Good catch, that's a typo, thanks!

EDjur · 2020-02-26T08:12:38Z

Cool, nice to hear that!

One more thing that came to my mind.
Would it be possible to merge AnnotateImage and BatchAnnotateImage transforms? I think the user has little interest in configuring min_batch_size and max_batch_size, because, at the end, responses are flatten (output PCollection is of type AnnotateImageResponse, not List[AnnotateImageResponse]). Also, I expect that batch_annotate_images with len(requests) == 1 works the same way as annotate_images. But maybe I'm wrong.

Good guess! Indeed, looking at the source code for annotate_image, it is essentially just a wrapper around batch_annotate_images like so r = self.batch_annotate_images([request], retry=retry, timeout=timeout).

I'll need to keep the min_batch_size and max_batch_size parameters for testing purposes, in order to get a valid count of the API calls as otherwise the number of calls is dependent on the BatchElements transform. But I'll make sure to document these two parameters as such.

…forms

EDjur · 2020-02-26T09:29:48Z

@aaltay I noticed an issue in setup.py for the videointelligence library: it was missing a comma. I added that fix in this PR if that's okay.

I also updated CHANGES.md to specify that the videointelligence features are for the Python SDK.

kamilwu · 2020-02-26T13:16:10Z

sdks/python/apache_beam/ml/gcp/visionml_test.py

+class VisionTest(unittest.TestCase):
+  def setUp(self):
+    self._mock_client = mock.Mock()
+    self.m2 = mock.Mock()


This mock seems to be unused

kamilwu · 2020-02-26T13:25:28Z

sdks/python/apache_beam/ml/gcp/visionml_test.py

+      query_result = result.metrics().query(read_filter)
+      if query_result['counters']:
+        read_counter = query_result['counters'][0]
+        self.assertTrue(read_counter.committed == expected_counter)


It's safer to use result property rather than committed, because when committed is empty, attempted is being used. See the code:

beam/sdks/python/apache_beam/metrics/execution.py

Lines 129 to 134 in f9f3159

@property

def result(self):

"""Short-hand for falling back to attempted metrics if it seems that

committed was not populated (e.g. due to not being supported on a given

runner"""

return self.committed if self.committed else self.attempted

kamilwu · 2020-02-26T13:31:12Z

LGTM. I've left two small comments, but consider them as non-blockers.

Thanks for merging those PTransform I mentioned and for documenting batch_size parameters.

EDjur · 2020-02-26T13:37:08Z

Thanks for the thorough review!

mwalenia · 2020-02-26T13:46:18Z

retest this please

EDjur added 4 commits February 13, 2020 14:08

Initial visionapi integration

cae4f82

Merge branch 'master' into BEAM-9247/gcp-vision-api

a9f157e

AnnotateImage and AnnotateImageWithContext PTransforms

29f3498

AnnotateImage and AsyncBatchAnnotateImage functionality using GCP Vis…

096d02e

…ion API

probot-autolabeler bot added the python label Feb 25, 2020

EDjur added 2 commits February 25, 2020 09:38

Merged master

1da18b5

Remove typoed videointelligence dependency

3691df2

EDjur commented Feb 25, 2020

View reviewed changes

sdks/python/apache_beam/ml/gcp/visionml.py Outdated Show resolved Hide resolved

Using util.BatchElements. Changed async batching to sync (online)

8035766

EDjur added 2 commits February 25, 2020 16:37

Minor fixes

44f5360

Added changelog

12f01f3

EDjur added 2 commits February 26, 2020 10:14

Merged single-element and batch image annotation into the same PTrans…

b91b9b8

…forms

Pinned required versions in setup.py

c16356e

Minor style fix in setup.py: Add trailing comma.

54fd6c1

kamilwu reviewed Feb 26, 2020

View reviewed changes

read_counter.commited -> read_counter.result. Remove unused mock

d2f01fb

mwalenia merged commit cf3ca68 into apache:master Feb 26, 2020

mwalenia mentioned this pull request Feb 26, 2020

[BEAM-9247] Cloud Vision transform mwalenia/beam#18

Closed

	@property
	def result(self):
	"""Short-hand for falling back to attempted metrics if it seems that
	committed was not populated (e.g. due to not being supported on a given
	runner"""
	return self.committed if self.committed else self.attempted

[BEAM-9247] Integrate GCP Vision API functionality #10959

[BEAM-9247] Integrate GCP Vision API functionality #10959

Uh oh!

Conversation

EDjur commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Uh oh!

EDjur commented Feb 25, 2020

Uh oh!

robertwb commented Feb 25, 2020

Uh oh!

EDjur commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kamilwu commented Feb 25, 2020

Uh oh!

kamilwu commented Feb 25, 2020

Uh oh!

kamilwu commented Feb 25, 2020

Uh oh!

EDjur commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kamilwu commented Feb 25, 2020

Uh oh!

EDjur commented Feb 25, 2020

Uh oh!

kamilwu commented Feb 25, 2020

Uh oh!

aaltay commented Feb 25, 2020

Uh oh!

EDjur commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EDjur commented Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EDjur commented Feb 26, 2020

Uh oh!

kamilwu Feb 26, 2020

Choose a reason for hiding this comment

Uh oh!

kamilwu Feb 26, 2020

Choose a reason for hiding this comment

Uh oh!

kamilwu commented Feb 26, 2020

Uh oh!

EDjur commented Feb 26, 2020

Uh oh!

mwalenia commented Feb 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

EDjur commented Feb 25, 2020 •

edited

Loading

EDjur commented Feb 25, 2020 •

edited

Loading

EDjur commented Feb 25, 2020 •

edited

Loading

EDjur commented Feb 25, 2020 •

edited

Loading

EDjur commented Feb 26, 2020 •

edited

Loading