-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-3207] Create a standard location to enumerate and document URNs. #4310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins retest this please. (ERROR: beam8 is offline) |
|
Jenkins retest this please |
1 similar comment
|
Jenkins retest this please |
kennknowles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good move. I have some comments on the approach just to clarify the direction to take it from here. In particular, we definitely want long-form descriptions of what the transforms do. I'd also like to separate primitive pieces from reserved URNs that are not primitives.
The markdown already serves OK but have you thought about how to automatically publish on the site and/or snippets in a section in the contributor guide?
|
|
||
| ## Side input access | ||
|
|
||
| ### org.apache.beam:sideinput:iterable:v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in other places words are underscore-separated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
|
|
||
| # Apache Beam URNs | ||
|
|
||
| This file serves as a central place to enumerate and document the various |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe YAML would allow easier parsing and machine-readable association of metadata and commentary with the URNs? This file already suggests a comment and payload field for general description and a machine-readable spec for the payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we made the comment and payload fields machine-readable, there's not much automated we could do with them. The greater need is to unify and document these urns, which is why I chose markdown (for easy human production and consumption).
|
|
||
| ## Core Transforms | ||
|
|
||
| ### org.apache.beam:transform:pardo:v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, a URN has the form urn: namespace : namespace-specific section
So, we drop the urn since it is redundant from context and I'm not sure we really need org.apache.beam as the namespace. It isn't an "authority" section as in other URIs. We can just use beam as we already have in most of the constants. The actual process is just that you get it registered with IANA here but I don't know that we need to. TBH even "beam" is probably redundant when we are putting them into URN fields in a Beam proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was interesting how many different conventions were already being used. I went with org.apache.beam because it was the least ambiguous and agreed with the ones produced by harness code (e.g. the GRPC read/writes).
I don't have any strong opinions here, other than consistency. (We should probably at least have a beam prefix, reserved for things in the beam repo, to make it easy to avoid accidental conflicts.) But I'd like to find/replace them only one more time (which was another reason I went for "org.apache.beam:..." :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we follow semantic versioning for the version numbers similar to how we are releasing the SDK versions (e.g. org.apache.beam:transform:pardo:1.0.0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we're just requiring exact matches. I'm not sure what a "bugfix" version means for URNs, I suppose minor upgrades could mean adding formerly optional fields to proto-like payloads, but that could be checked for directly as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is pretty common for protocols to use the minor version, but I don't know of any that use the bugfix version.
There's a polarity issue in the higher-order case that seems to also apply here - adding a field to an interface is backwards compatible for consumers, but incompatible for implementers of the interface. But adding an optional field is both backward and forward compatible so you don't need a minor version bump.
| sdks/python/NOTICE | ||
| sdks/python/README.md | ||
| sdks/python/apache_beam/portability/api/*pb2*.* | ||
| sdks/python/apache_beam/portability/common_urns.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to have a generated file? Can't you just reflectively and lazily generate the module/class/constants? (whichever is easiest)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but in that case we'd have to copy/distribute the .md file for pypi anyways (as most users won't be running this from within the github tree) so I figured this approach is easier. (Also has advantages for IDEs, and is similar to what we're doing for proto files and will want to do for Java.)
robertwb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look. I think it would certainly make sense to put a link to this doc from the site.
We could separate primitive transforms URNs from non-primitive ones, but I think it's make the most sense to just document which ones are expected for all SDKs to support rather than have two locations to look at.
| sdks/python/NOTICE | ||
| sdks/python/README.md | ||
| sdks/python/apache_beam/portability/api/*pb2*.* | ||
| sdks/python/apache_beam/portability/common_urns.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but in that case we'd have to copy/distribute the .md file for pypi anyways (as most users won't be running this from within the github tree) so I figured this approach is easier. (Also has advantages for IDEs, and is similar to what we're doing for proto files and will want to do for Java.)
|
|
||
| # Apache Beam URNs | ||
|
|
||
| This file serves as a central place to enumerate and document the various |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we made the comment and payload fields machine-readable, there's not much automated we could do with them. The greater need is to unify and document these urns, which is why I chose markdown (for easy human production and consumption).
|
|
||
| ## Core Transforms | ||
|
|
||
| ### org.apache.beam:transform:pardo:v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was interesting how many different conventions were already being used. I went with org.apache.beam because it was the least ambiguous and agreed with the ones produced by harness code (e.g. the GRPC read/writes).
I don't have any strong opinions here, other than consistency. (We should probably at least have a beam prefix, reserved for things in the beam repo, to make it easy to avoid accidental conflicts.) But I'd like to find/replace them only one more time (which was another reason I went for "org.apache.beam:..." :).
|
|
||
| ## Side input access | ||
|
|
||
| ### org.apache.beam:sideinput:iterable:v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
|
|
||
| ## Side input access | ||
|
|
||
| ### org.apache.beam:side_input:iterable:v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one shouldn't exist/be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at what it would take to remove this from the Python SDK, and given that the existing batch code doesn't support indexable side inputs I think we should offer this option as well. (This seems a better option than generating different graphs when trying to run portably.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we drop it for now and just use it inside the Python SDK till there is a need. I don't want this to make its way out of the Python SDK till there is a strong need for iterable side inputs over the Fn API.
|
|
||
| ## Core Transforms | ||
|
|
||
| ### org.apache.beam:transform:pardo:v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we follow semantic versioning for the version numbers similar to how we are releasing the SDK versions (e.g. org.apache.beam:transform:pardo:1.0.0)?
|
Updated per discussion. PTAL. |
|
Jenkins: retest this please. (Backing channel 'beam8' is disconnected) |
|
Jenkins: retest this please. |
|
Rebased again due to merge conflicts. It would be good to get this in. |
|
Yea, many apologies for the delay. Review load is very high lately (a good thing). |
|
It does look like the errors may be caused by things introduced since you wrote this. I'm jumping to that conclusion because I saw mention of URNs in the error messages... |
d221a25 to
f04d2e5
Compare
URNs are listed in a markdown file in the pipeline definitions module.
This file is used to auto-generate URN constants for the Python SDK and
validate URN constants in the Java SDK (though eventually it'd be good
to auto-generate them in this case as well).
The format of these common URNs has been normalized to
org.apache.beam:type:name:vN[.M]
SDK-specific URNs are left as they are.
Further fleshing out the definitions and specifications of all these URNS,
as well as making sure they are used ubiquitiously, can be deferred to
a later PR now that there is a central location to work from.
00c247a to
b9aa857
Compare
|
Jenkins: retest this please. |
|
I have resolved the issues with the Dataflow runner, opting to leave those URNs currently hard coded in the workers as they are in this PR (which is already big and painful enough as it is) and defer the worker dance to a future one. PTAL |
|
I'm wary about |
|
Thanks! |
| self._check_pcollection(pcoll) | ||
| return pvalue.PCollection(pcoll.pipeline) | ||
|
|
||
| def to_runner_api_parameter(self, context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this removed? This change conflicts with the changes in #4529.
URNs are listed in a markdown file in the pipeline definitions module.
This file is used to auto-generate URN constants for the Python SDK and
validate URN constants in the Java SDK (though eventually it'd be good
to auto-generate them in this case as well).
The format of these common URNs has been normalized to
SDK-specific URNs are left as they are.
Further fleshing out the definitions and specifications of all these URNS,
as well as making sure they are used ubiquitiously, can be deferred to
a later PR now that there is a central location to work from.
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue.mvn clean verifyto make sure basic checks pass. A more thorough check will be performed on your pull request automatically.