-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-8088] Track PCollection boundedness in python sdk #9426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Just to clarify, this is to have an unbounded PCollection during the expansion of the WriteToPubSub write? |
Correct.
@Override
public PDone expand(PCollection<T> input) {
if (getTopicProvider() == null) {
throw new IllegalStateException("need to set the topic of a PubsubIO.Write transform");
}
switch (input.isBounded()) {
case BOUNDED:
input.apply(
ParDo.of(
new PubsubBoundedWriter(
MoreObjects.firstNonNull(getMaxBatchSize(), MAX_PUBLISH_BATCH_SIZE),
MoreObjects.firstNonNull(
getMaxBatchBytesSize(), MAX_PUBLISH_BATCH_BYTE_SIZE_DEFAULT))));
return PDone.in(input.getPipeline());
case UNBOUNDED:
return input
.apply(MapElements.via(getFormatFn()))
.apply(
new PubsubUnboundedSink(
Optional.ofNullable(getPubsubClientFactory()).orElse(FACTORY),
NestedValueProvider.of(getTopicProvider(), new TopicPathTranslator()),
getTimestampAttribute(),
getIdAttribute(),
100 /* numShards */,
MoreObjects.firstNonNull(
getMaxBatchSize(), PubsubUnboundedSink.DEFAULT_PUBLISH_BATCH_SIZE),
MoreObjects.firstNonNull(
getMaxBatchBytesSize(),
PubsubUnboundedSink.DEFAULT_PUBLISH_BATCH_BYTES)));
}
throw new RuntimeException(); // cases are exhaustive.
}The |
mxm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fair. The boundedness is part of the Proto and thus should be available also from within Python.
Should we add a test for this in pipeline_test.py?
5cedbea to
2e39266
Compare
Done! |
|
Thanks! |
As far as I can tell Python does not care about boundedness of PCollections even in streaming mode, but external transforms do. In my ongoing effort to get PubsubIO external transforms working I discovered that I could not generate an unbounded write using the expansion service from python: it always came back as a bounded write.
My pipeline looks like this:
The PCollections returned from the external Read are Unbounded, as expected, but python is responsible for creating the intermediate PCollection, which is always Bounded, and thus external Write generated by Java is always Bounded.
If I'm on the right track here I'll make some tests. I've manually tested it enough to confirm that it gets my external xform tests working.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.