[BEAM-53] Wire PubsubUnbounded{Source,Sink} into PubsubIO #346

mshields822 · 2016-05-17T23:48:04Z

This also refines the handling of record ids in the sink to be random-but-reused-on-failure, using the same trick as we do for the BigQuery sink.

Still need to re-do the load tests I did a few weeks back with the actual change.
Note that last time I tested the DataflowPipelineTranslator does not kick in and replace the new transforms with the correct native transforms. Need to dig deeper.

R: @dhalperi

dhalperi · 2016-05-17T23:55:47Z

...ataflow-java/src/main/java/org/apache/beam/runners/dataflow/internal/PubsubIOTranslator.java

      context.addStep(transform, "ParallelRead");
      context.addInput(PropertyNames.FORMAT, "pubsub");
-      if (transform.getTopic() != null) {
-        context.addInput(PropertyNames.PUBSUB_TOPIC, transform.getTopic().asV1Beta1Path());


is this no longer supported? is this a regression of some sort?

Only the dataflow service has a notion of 'cleanup when the job is done', and so can delete random subscriptions created if PubsubIO.Read is only given a topic. So PubsubUnboundedSource requires a subscription and PubsubIO.Read requires it to be specified.

But then this translation approach won't work since the topic has not been captured by the PubsubUnboundedSource. Dang, missed that, thanks for noticing. Perhaps I can look at the parent PubsubIO.Read PTransform?

I think that we should leave in the ability to create a subscription on-demand, but record an ugly log message that there will be a leftover subscription and proceed. We have finalizers/fault tolerance on the pipeline roadmap, but it's not here yet.

I feel like there's a JIRA issue for this general issue, but cannot find it.

Ugly, but I capture both the subscription and topic and defer we're given a subscription to the PubsubUnboundedSource.apply.

I really don't like it.
The batch direct runner creates the subscription on one worker when the pipeline begins data processing, cleans up when done.
The streaming Google Cloud Dataflow runner creates the subscription on dax when the job is setup, cleans up when the job is terminated.
The streaming Java-only runner would create it when the pipeline graph is constructed, and never cleans it up.
The lack of cleanup plus phase difference of creation time all feels wrong...

... but I'll do it anyway.

dhalperi · 2016-05-18T01:03:33Z

Keep pinging me or @peihe as you get tests going and if you have trouble with the translator.

mshields822 · 2016-05-18T04:13:51Z

No luck running against apache_beam-on-google runner (NoClassDefFound exceptions).
Can you give this one more look and if you're happy I'll port to dataflow and confirm all is well there, then return back here?

…re-using record ids.

mshields822 · 2016-05-19T22:32:23Z

Ok I'm back after getting the Google Cloud Dataflow runner rewrites working.

dhalperi · 2016-05-19T23:12:42Z

...oud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineRunner.java

+      // In streaming mode must use either the custom Pubsub unbounded source/sink or
+      // defer to Windmill's built-in implementation.
+      builder.put(PubsubIO.Read.Bound.PubsubBoundedReader.class, UnsupportedIO.class);
+      builder.put(PubsubIO.Write.Bound.PubsubBoundedWriter.class, UnsupportedIO.class);


These are both DoFns, not PTransforms, so I do not think this will have any effect.

@kennknowles any suggestions?

I added support for this in UnsupportedIO below.

mshields822 · 2016-05-19T23:51:08Z

PTAL

mshields822 · 2016-05-20T00:18:41Z

Confirmed working with beam-on-dataflow + beam-worker for both java-only and internal pubsub sources/sinks.

dhalperi · 2016-05-20T00:41:58Z

...oud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineRunner.java

          ? "streaming" : "batch";
+      String name =
+          transform == null ? approximateSimpleName(doFn.getClass()) :
+          approximatePTransformName(transform.getClass());


fix wrapping here? : should be at the start of the line

dhalperi · 2016-05-20T01:06:51Z

LGTM. I can fix the : and then merge.

mshields822 · 2016-05-20T15:39:46Z

Thanks!

On Thu, May 19, 2016 at 6:27 PM, asfgit notifications@github.com wrote:

Closed #346 #346 via
d0b9ca9
d0b9ca9
.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#346 (comment)

* Add ValueProvider * Update DataflowPipelineRunner.java * Update DataflowPipelineRunnerHooks.java * Update DataflowPipelineRunnerHooks.java * Create TemplatingDataflowPipelineRunner.java * Add test.' * update * add test * Fix messages * Remove ValueProvider * Fix formatting. * Remove dupe * Fix typo * Fix style errors. * Fix style errors. * Fix things * fixes * Update TemplatingDataflowPipelineRunnerTest.java * Do not return null, return a useless Job instead. * Update TemplatingDataflowPipelineRunner.java * Update TemplatingDataflowPipelineRunner.java

* fix: Fixes apache#346 by reseeding for each auto id on py3.6

dhalperi reviewed May 17, 2016
View reviewed changes

mshields822 mentioned this pull request May 19, 2016

Backport Java-only Pubsub GoogleCloudPlatform/DataflowJavaSDK#273

Closed

Mark Shields added 5 commits May 19, 2016 14:59

Make java unbounded pub/sub source the default.

a96f332

Refine record id calculation. Prepare for supporting unit tests with …

74559a7

…re-using record ids.

Dan's comments

b6af243

Forward port from DataflowJavaSDK

8652ddd

Fixups from forward port

457434a

dhalperi reviewed May 19, 2016
View reviewed changes

Dan's comments

81d95cb

dhalperi reviewed May 20, 2016
View reviewed changes

asfgit closed this in d0b9ca9 May 20, 2016

dhalperi mentioned this pull request May 20, 2016

PubsubIO: implement an unbounded source and sink GoogleCloudPlatform/DataflowJavaSDK#278

Merged

iemejia pushed a commit to iemejia/beam that referenced this pull request Jan 12, 2018

This closes apache#346

fa6bd55

kennknowles mentioned this pull request Jun 3, 2022

Cleanup pubsub subscriptions #19051

Open

pl04351820 pushed a commit to pl04351820/beam that referenced this pull request Dec 20, 2023

fix: reseed for each auto id on 3.6 to avoid collisions (apache#388)

784e8ae

* fix: Fixes apache#346 by reseeding for each auto id on py3.6

[BEAM-53] Wire PubsubUnbounded{Source,Sink} into PubsubIO #346

[BEAM-53] Wire PubsubUnbounded{Source,Sink} into PubsubIO #346

Uh oh!

Conversation

mshields822 commented May 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhalperi May 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhalperi commented May 18, 2016

Uh oh!

mshields822 commented May 18, 2016

Uh oh!

mshields822 commented May 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mshields822 commented May 19, 2016

Uh oh!

mshields822 commented May 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhalperi commented May 20, 2016

Uh oh!

mshields822 commented May 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dhalperi May 18, 2016 •

edited

Loading