[BEAM-3207] Create a standard location to enumerate and document URNs. #4310

robertwb · 2017-12-21T22:01:21Z

URNs are listed in a markdown file in the pipeline definitions module.
This file is used to auto-generate URN constants for the Python SDK and
validate URN constants in the Java SDK (though eventually it'd be good
to auto-generate them in this case as well).

The format of these common URNs has been normalized to

org.apache.beam:type:name:vN[.M]

SDK-specific URNs are left as they are.

Further fleshing out the definitions and specifications of all these URNS,
as well as making sure they are used ubiquitiously, can be deferred to
a later PR now that there is a central location to work from.

Follow this checklist to help us incorporate your contribution quickly and easily:

Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
Each commit in the pull request should have a meaningful subject line and body.
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

robertwb · 2017-12-21T23:30:06Z

Jenkins retest this please. (ERROR: beam8 is offline)

robertwb · 2017-12-22T07:07:53Z

Jenkins retest this please

robertwb · 2018-01-02T20:14:18Z

Jenkins retest this please

kennknowles

This is a good move. I have some comments on the approach just to clarify the direction to take it from here. In particular, we definitely want long-form descriptions of what the transforms do. I'd also like to separate primitive pieces from reserved URNs that are not primitives.

The markdown already serves OK but have you thought about how to automatically publish on the site and/or snippets in a section in the contributor guide?

kennknowles · 2018-01-02T22:21:42Z

model/pipeline/src/main/resources/org/apache/beam/model/common_urns.md

+
+## Side input access
+
+### org.apache.beam:sideinput:iterable:v1


nit: in other places words are underscore-separated

kennknowles · 2018-01-02T22:21:43Z

model/pipeline/src/main/resources/org/apache/beam/model/common_urns.md

+
+# Apache Beam URNs
+
+This file serves as a central place to enumerate and document the various


Maybe YAML would allow easier parsing and machine-readable association of metadata and commentary with the URNs? This file already suggests a comment and payload field for general description and a machine-readable spec for the payload.

Even if we made the comment and payload fields machine-readable, there's not much automated we could do with them. The greater need is to unify and document these urns, which is why I chose markdown (for easy human production and consumption).

kennknowles · 2018-01-02T22:21:43Z

model/pipeline/src/main/resources/org/apache/beam/model/common_urns.md

+
+## Core Transforms
+
+### org.apache.beam:transform:pardo:v1


Technically, a URN has the form urn: namespace : namespace-specific section

So, we drop the urn since it is redundant from context and I'm not sure we really need org.apache.beam as the namespace. It isn't an "authority" section as in other URIs. We can just use beam as we already have in most of the constants. The actual process is just that you get it registered with IANA here but I don't know that we need to. TBH even "beam" is probably redundant when we are putting them into URN fields in a Beam proto.

It was interesting how many different conventions were already being used. I went with org.apache.beam because it was the least ambiguous and agreed with the ones produced by harness code (e.g. the GRPC read/writes).

I don't have any strong opinions here, other than consistency. (We should probably at least have a beam prefix, reserved for things in the beam repo, to make it easy to avoid accidental conflicts.) But I'd like to find/replace them only one more time (which was another reason I went for "org.apache.beam:..." :).

Should we follow semantic versioning for the version numbers similar to how we are releasing the SDK versions (e.g. org.apache.beam:transform:pardo:1.0.0)?

Currently we're just requiring exact matches. I'm not sure what a "bugfix" version means for URNs, I suppose minor upgrades could mean adding formerly optional fields to proto-like payloads, but that could be checked for directly as well.

It is pretty common for protocols to use the minor version, but I don't know of any that use the bugfix version.

There's a polarity issue in the higher-order case that seems to also apply here - adding a field to an interface is backwards compatible for consumers, but incompatible for implementers of the interface. But adding an optional field is both backward and forward compatible so you don't need a minor version bump.

kennknowles · 2018-01-02T22:21:43Z

.gitignore

 sdks/python/NOTICE
 sdks/python/README.md
 sdks/python/apache_beam/portability/api/*pb2*.*
+sdks/python/apache_beam/portability/common_urns.py


Is it necessary to have a generated file? Can't you just reflectively and lazily generate the module/class/constants? (whichever is easiest)

I thought about that, but in that case we'd have to copy/distribute the .md file for pypi anyways (as most users won't be running this from within the github tree) so I figured this approach is easier. (Also has advantages for IDEs, and is similar to what we're doing for proto files and will want to do for Java.)

robertwb

Thanks for taking a look. I think it would certainly make sense to put a link to this doc from the site.

We could separate primitive transforms URNs from non-primitive ones, but I think it's make the most sense to just document which ones are expected for all SDKs to support rather than have two locations to look at.

robertwb · 2018-01-02T22:48:31Z

.gitignore

 sdks/python/NOTICE
 sdks/python/README.md
 sdks/python/apache_beam/portability/api/*pb2*.*
+sdks/python/apache_beam/portability/common_urns.py


I thought about that, but in that case we'd have to copy/distribute the .md file for pypi anyways (as most users won't be running this from within the github tree) so I figured this approach is easier. (Also has advantages for IDEs, and is similar to what we're doing for proto files and will want to do for Java.)

robertwb · 2018-01-02T22:50:56Z

model/pipeline/src/main/resources/org/apache/beam/model/common_urns.md

+
+# Apache Beam URNs
+
+This file serves as a central place to enumerate and document the various


Even if we made the comment and payload fields machine-readable, there's not much automated we could do with them. The greater need is to unify and document these urns, which is why I chose markdown (for easy human production and consumption).

robertwb · 2018-01-02T22:56:25Z

model/pipeline/src/main/resources/org/apache/beam/model/common_urns.md

+
+## Core Transforms
+
+### org.apache.beam:transform:pardo:v1


It was interesting how many different conventions were already being used. I went with org.apache.beam because it was the least ambiguous and agreed with the ones produced by harness code (e.g. the GRPC read/writes).

I don't have any strong opinions here, other than consistency. (We should probably at least have a beam prefix, reserved for things in the beam repo, to make it easy to avoid accidental conflicts.) But I'd like to find/replace them only one more time (which was another reason I went for "org.apache.beam:..." :).

robertwb · 2018-01-03T00:16:32Z

model/pipeline/src/main/resources/org/apache/beam/model/common_urns.md

+
+## Side input access
+
+### org.apache.beam:sideinput:iterable:v1


lukecwik · 2018-01-03T18:11:08Z

model/pipeline/src/main/resources/org/apache/beam/model/common_urns.md

+
+## Side input access
+
+### org.apache.beam:side_input:iterable:v1


This one shouldn't exist/be removed.

I was looking at what it would take to remove this from the Python SDK, and given that the existing batch code doesn't support indexable side inputs I think we should offer this option as well. (This seems a better option than generating different graphs when trying to run portably.)

How about we drop it for now and just use it inside the Python SDK till there is a need. I don't want this to make its way out of the Python SDK till there is a strong need for iterable side inputs over the Fn API.

lukecwik · 2018-01-03T18:14:03Z

model/pipeline/src/main/resources/org/apache/beam/model/common_urns.md

+
+## Core Transforms
+
+### org.apache.beam:transform:pardo:v1


Should we follow semantic versioning for the version numbers similar to how we are releasing the SDK versions (e.g. org.apache.beam:transform:pardo:1.0.0)?

robertwb · 2018-01-05T20:30:18Z

Updated per discussion. PTAL.

robertwb · 2018-01-05T21:50:14Z

Jenkins: retest this please. (Backing channel 'beam8' is disconnected)

robertwb · 2018-01-16T21:44:19Z

Jenkins: retest this please.

robertwb · 2018-01-18T18:45:39Z

Rebased again due to merge conflicts. It would be good to get this in.

kennknowles · 2018-01-19T06:06:09Z

Yea, many apologies for the delay. Review load is very high lately (a good thing).

kennknowles · 2018-01-19T06:08:17Z

It does look like the errors may be caused by things introduced since you wrote this. I'm jumping to that conclusion because I saw mention of URNs in the error messages...

URNs are listed in a markdown file in the pipeline definitions module. This file is used to auto-generate URN constants for the Python SDK and validate URN constants in the Java SDK (though eventually it'd be good to auto-generate them in this case as well). The format of these common URNs has been normalized to org.apache.beam:type:name:vN[.M] SDK-specific URNs are left as they are. Further fleshing out the definitions and specifications of all these URNS, as well as making sure they are used ubiquitiously, can be deferred to a later PR now that there is a central location to work from.

robertwb · 2018-02-02T01:40:14Z

Jenkins: retest this please.

robertwb · 2018-02-02T16:20:12Z

I have resolved the issues with the Dataflow runner, opting to leave those URNs currently hard coded in the workers as they are in this PR (which is already big and painful enough as it is) and defer the worker dance to a future one.

PTAL

kennknowles · 2018-02-03T22:43:32Z

I'm wary about setup.py because "actually programming" in builds is one of the worst sources of unpredictable maintenance burdens, but let's just get this in and iterate because it is definitely useful immediately. The gradle failure is BEAM-3605.

robertwb · 2018-02-05T18:13:30Z

Thanks!

charlesccychen · 2018-02-06T01:52:34Z

sdks/python/apache_beam/transforms/core.py

    self._check_pcollection(pcoll)
    return pvalue.PCollection(pcoll.pipeline)

-  def to_runner_api_parameter(self, context):


Why was this removed? This change conflicts with the changes in #4529.

robertwb assigned kennknowles Dec 21, 2017

robertwb requested a review from kennknowles December 21, 2017 23:58

kennknowles reviewed Jan 2, 2018

View reviewed changes

robertwb commented Jan 3, 2018

View reviewed changes

lukecwik reviewed Jan 3, 2018

View reviewed changes

youngoli mentioned this pull request Jan 16, 2018

[BEAM-3126] Creating flatten operation in Python SDK Harness #4408

Merged

6 tasks

robertwb force-pushed the urns branch from 32406e5 to dde3c90 Compare January 18, 2018 18:45

kennknowles assigned robertwb and unassigned kennknowles Jan 19, 2018

robertwb force-pushed the urns branch 2 times, most recently from d221a25 to f04d2e5 Compare January 31, 2018 21:48

robertwb force-pushed the urns branch 4 times, most recently from 00c247a to b9aa857 Compare February 1, 2018 22:18

Revert URNs that are currently hard-coded in the Dataflow worker.

04c399c

robertwb force-pushed the urns branch from b9aa857 to 04c399c Compare February 2, 2018 00:20

kennknowles merged commit 42ac62a into apache:master Feb 3, 2018

charlesccychen reviewed Feb 6, 2018

View reviewed changes


		## Side input access

		### org.apache.beam:sideinput:iterable:v1


		# Apache Beam URNs

		This file serves as a central place to enumerate and document the various


		## Side input access

		### org.apache.beam:side_input:iterable:v1

[BEAM-3207] Create a standard location to enumerate and document URNs. #4310

[BEAM-3207] Create a standard location to enumerate and document URNs. #4310

Uh oh!

Conversation

robertwb commented Dec 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertwb commented Dec 21, 2017

Uh oh!

robertwb commented Dec 22, 2017

Uh oh!

robertwb commented Jan 2, 2018

Uh oh!

kennknowles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kennknowles Jan 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertwb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertwb commented Jan 5, 2018

Uh oh!

robertwb commented Jan 5, 2018

Uh oh!

robertwb commented Jan 16, 2018

Uh oh!

robertwb commented Jan 18, 2018

Uh oh!

kennknowles commented Jan 19, 2018

Uh oh!

kennknowles commented Jan 19, 2018

Uh oh!

robertwb commented Feb 2, 2018

Uh oh!

robertwb commented Feb 2, 2018

Uh oh!

kennknowles commented Feb 3, 2018

Uh oh!

robertwb commented Feb 5, 2018

Uh oh!

Choose a reason for hiding this comment

robertwb commented Dec 21, 2017 •

edited

Loading

kennknowles Jan 2, 2018 •

edited

Loading