Skip to content

Conversation

@robertwb
Copy link
Contributor

@robertwb robertwb commented Dec 21, 2017

URNs are listed in a markdown file in the pipeline definitions module.
This file is used to auto-generate URN constants for the Python SDK and
validate URN constants in the Java SDK (though eventually it'd be good
to auto-generate them in this case as well).

The format of these common URNs has been normalized to

org.apache.beam:type:name:vN[.M]

SDK-specific URNs are left as they are.

Further fleshing out the definitions and specifications of all these URNS,
as well as making sure they are used ubiquitiously, can be deferred to
a later PR now that there is a central location to work from.

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Each commit in the pull request should have a meaningful subject line and body.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@robertwb
Copy link
Contributor Author

Jenkins retest this please. (ERROR: beam8 is offline)

@robertwb robertwb requested a review from kennknowles December 21, 2017 23:58
@robertwb
Copy link
Contributor Author

Jenkins retest this please

1 similar comment
@robertwb
Copy link
Contributor Author

robertwb commented Jan 2, 2018

Jenkins retest this please

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good move. I have some comments on the approach just to clarify the direction to take it from here. In particular, we definitely want long-form descriptions of what the transforms do. I'd also like to separate primitive pieces from reserved URNs that are not primitives.

The markdown already serves OK but have you thought about how to automatically publish on the site and/or snippets in a section in the contributor guide?


## Side input access

### org.apache.beam:sideinput:iterable:v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in other places words are underscore-separated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


# Apache Beam URNs

This file serves as a central place to enumerate and document the various
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe YAML would allow easier parsing and machine-readable association of metadata and commentary with the URNs? This file already suggests a comment and payload field for general description and a machine-readable spec for the payload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we made the comment and payload fields machine-readable, there's not much automated we could do with them. The greater need is to unify and document these urns, which is why I chose markdown (for easy human production and consumption).


## Core Transforms

### org.apache.beam:transform:pardo:v1
Copy link
Member

@kennknowles kennknowles Jan 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, a URN has the form urn: namespace : namespace-specific section

So, we drop the urn since it is redundant from context and I'm not sure we really need org.apache.beam as the namespace. It isn't an "authority" section as in other URIs. We can just use beam as we already have in most of the constants. The actual process is just that you get it registered with IANA here but I don't know that we need to. TBH even "beam" is probably redundant when we are putting them into URN fields in a Beam proto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was interesting how many different conventions were already being used. I went with org.apache.beam because it was the least ambiguous and agreed with the ones produced by harness code (e.g. the GRPC read/writes).

I don't have any strong opinions here, other than consistency. (We should probably at least have a beam prefix, reserved for things in the beam repo, to make it easy to avoid accidental conflicts.) But I'd like to find/replace them only one more time (which was another reason I went for "org.apache.beam:..." :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we follow semantic versioning for the version numbers similar to how we are releasing the SDK versions (e.g. org.apache.beam:transform:pardo:1.0.0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we're just requiring exact matches. I'm not sure what a "bugfix" version means for URNs, I suppose minor upgrades could mean adding formerly optional fields to proto-like payloads, but that could be checked for directly as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is pretty common for protocols to use the minor version, but I don't know of any that use the bugfix version.

There's a polarity issue in the higher-order case that seems to also apply here - adding a field to an interface is backwards compatible for consumers, but incompatible for implementers of the interface. But adding an optional field is both backward and forward compatible so you don't need a minor version bump.

sdks/python/NOTICE
sdks/python/README.md
sdks/python/apache_beam/portability/api/*pb2*.*
sdks/python/apache_beam/portability/common_urns.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to have a generated file? Can't you just reflectively and lazily generate the module/class/constants? (whichever is easiest)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but in that case we'd have to copy/distribute the .md file for pypi anyways (as most users won't be running this from within the github tree) so I figured this approach is easier. (Also has advantages for IDEs, and is similar to what we're doing for proto files and will want to do for Java.)

Copy link
Contributor Author

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look. I think it would certainly make sense to put a link to this doc from the site.

We could separate primitive transforms URNs from non-primitive ones, but I think it's make the most sense to just document which ones are expected for all SDKs to support rather than have two locations to look at.

sdks/python/NOTICE
sdks/python/README.md
sdks/python/apache_beam/portability/api/*pb2*.*
sdks/python/apache_beam/portability/common_urns.py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but in that case we'd have to copy/distribute the .md file for pypi anyways (as most users won't be running this from within the github tree) so I figured this approach is easier. (Also has advantages for IDEs, and is similar to what we're doing for proto files and will want to do for Java.)


# Apache Beam URNs

This file serves as a central place to enumerate and document the various
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we made the comment and payload fields machine-readable, there's not much automated we could do with them. The greater need is to unify and document these urns, which is why I chose markdown (for easy human production and consumption).


## Core Transforms

### org.apache.beam:transform:pardo:v1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was interesting how many different conventions were already being used. I went with org.apache.beam because it was the least ambiguous and agreed with the ones produced by harness code (e.g. the GRPC read/writes).

I don't have any strong opinions here, other than consistency. (We should probably at least have a beam prefix, reserved for things in the beam repo, to make it easy to avoid accidental conflicts.) But I'd like to find/replace them only one more time (which was another reason I went for "org.apache.beam:..." :).


## Side input access

### org.apache.beam:sideinput:iterable:v1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


## Side input access

### org.apache.beam:side_input:iterable:v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one shouldn't exist/be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at what it would take to remove this from the Python SDK, and given that the existing batch code doesn't support indexable side inputs I think we should offer this option as well. (This seems a better option than generating different graphs when trying to run portably.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we drop it for now and just use it inside the Python SDK till there is a need. I don't want this to make its way out of the Python SDK till there is a strong need for iterable side inputs over the Fn API.


## Core Transforms

### org.apache.beam:transform:pardo:v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we follow semantic versioning for the version numbers similar to how we are releasing the SDK versions (e.g. org.apache.beam:transform:pardo:1.0.0)?

@robertwb
Copy link
Contributor Author

robertwb commented Jan 5, 2018

Updated per discussion. PTAL.

@robertwb
Copy link
Contributor Author

robertwb commented Jan 5, 2018

Jenkins: retest this please. (Backing channel 'beam8' is disconnected)

@robertwb
Copy link
Contributor Author

Jenkins: retest this please.

@robertwb
Copy link
Contributor Author

Rebased again due to merge conflicts. It would be good to get this in.

@kennknowles
Copy link
Member

Yea, many apologies for the delay. Review load is very high lately (a good thing).

@kennknowles
Copy link
Member

It does look like the errors may be caused by things introduced since you wrote this. I'm jumping to that conclusion because I saw mention of URNs in the error messages...

@kennknowles kennknowles assigned robertwb and unassigned kennknowles Jan 19, 2018
@robertwb robertwb force-pushed the urns branch 2 times, most recently from d221a25 to f04d2e5 Compare January 31, 2018 21:48
URNs are listed in a markdown file in the pipeline definitions module.
This file is used to auto-generate URN constants for the Python SDK and
validate URN constants in the Java SDK (though eventually it'd be good
to auto-generate them in this case as well).

The format of these common URNs has been normalized to

    org.apache.beam:type:name:vN[.M]

SDK-specific URNs are left as they are.

Further fleshing out the definitions and specifications of all these URNS,
as well as making sure they are used ubiquitiously, can be deferred to
a later PR now that there is a central location to work from.
@robertwb robertwb force-pushed the urns branch 4 times, most recently from 00c247a to b9aa857 Compare February 1, 2018 22:18
@robertwb
Copy link
Contributor Author

robertwb commented Feb 2, 2018

Jenkins: retest this please.

@robertwb
Copy link
Contributor Author

robertwb commented Feb 2, 2018

I have resolved the issues with the Dataflow runner, opting to leave those URNs currently hard coded in the workers as they are in this PR (which is already big and painful enough as it is) and defer the worker dance to a future one.

PTAL

@kennknowles
Copy link
Member

I'm wary about setup.py because "actually programming" in builds is one of the worst sources of unpredictable maintenance burdens, but let's just get this in and iterate because it is definitely useful immediately. The gradle failure is BEAM-3605.

@kennknowles kennknowles merged commit 42ac62a into apache:master Feb 3, 2018
@robertwb
Copy link
Contributor Author

robertwb commented Feb 5, 2018

Thanks!

self._check_pcollection(pcoll)
return pvalue.PCollection(pcoll.pipeline)

def to_runner_api_parameter(self, context):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed? This change conflicts with the changes in #4529.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants