Skip to content

Conversation

@jklukas
Copy link
Contributor

@jklukas jklukas commented Nov 1, 2018

In the Java SDK, the Filesystems.match facilities are aimed primarily at listing file names and collect very limited additional metadata from the filesystem (sizeBytes and isReadSeekEfficient). This PR adds a new lastModified field to that list.

This could be a basis for a future improvement to FileIO.match(...).continuously(...) where we could let the user opt to poll not just for new file names, but also for existing file names if their content has been updated.

In the near term, the addition of lastModified to Metadata will allow users to implement their own polling logic on top of Filesystems.match to detect and download new files from any of the supported filesystems.


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
Build Status --- --- ---

@jklukas
Copy link
Contributor Author

jklukas commented Nov 1, 2018

I'm expecting some tests to fail initially. Seeing what fails will let me know what tests need to be updated.

@jklukas jklukas force-pushed the filesystem-meta-last-modified branch 3 times, most recently from 1c58a88 to 5610fbe Compare November 2, 2018 12:34
@jklukas
Copy link
Contributor Author

jklukas commented Nov 2, 2018

retest this please

1 similar comment
@jklukas
Copy link
Contributor Author

jklukas commented Nov 2, 2018

retest this please

@iemejia iemejia self-requested a review November 5, 2018 14:33
@iemejia
Copy link
Member

iemejia commented Nov 5, 2018

Just for comment, the PR looks great, but waiting for the discussion on @dev to see what is the recommended path for the MetadataCoder update.

@jklukas
Copy link
Contributor Author

jklukas commented Nov 5, 2018

Thanks for reviewing, @iemejia. I'm very interested to see what comes out of the discussion on evolving coders.

@iemejia
Copy link
Member

iemejia commented Nov 6, 2018

Yes hopefully it won't take long, but probably will require a rebase (assuming that we will version the coders and save this somewhere).

@jklukas jklukas force-pushed the filesystem-meta-last-modified branch 3 times, most recently from 60702bd to 6d1c746 Compare December 26, 2018 22:13
@jklukas
Copy link
Contributor Author

jklukas commented Dec 27, 2018

@iemejia - The mailing list discussion looks to have reached a conclusion that there's no short-term solution for coder versioning, so we need to be conservative about compatibility.

I've added a new commit here that returns MetadataCoder to compatibility with the existing format, providing a default value for lastModifiedMillis. We introduce a MetadataCoderV2 that encodes lastModifiedMillis, but is strictly opt-in; it's not used directly by the SDK at all.

Can you take another look and see if this looks good to merge?

@iemejia
Copy link
Member

iemejia commented Jan 8, 2019

Sure @jklukas sorry for the delay will take a look ASAP (Just back from christmass holidays).

@iemejia
Copy link
Member

iemejia commented Jan 9, 2019

Oups just realized I reviewed the wrong one hehe never mind, will start to check this one now. Sorry.

@jklukas
Copy link
Contributor Author

jklukas commented Jan 9, 2019

Oups just realized I reviewed the wrong one hehe never mind, will start to check this one now. Sorry.

😆

I really appreciate the reviews, @iemejia, and I'm glad you were able to take time off over the holidays.

@iemejia
Copy link
Member

iemejia commented Jan 11, 2019

Sorry for the delay @jklukas was a bit busy with other stuff + finishing other reviews, your PR is next in the line.

@jklukas jklukas force-pushed the filesystem-meta-last-modified branch from 6d1c746 to 8e9177a Compare January 16, 2019 17:30
Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some minor issues. Please fix and we are almost done.

Can you please add some tests (at least one) to ensure the correct behavior of coding/decoding a Metadata object with MetadataCoderV2. See CoderProperties.structuralValueDecodeEncodeEqual(CODER, value); for reference.

.setIsReadSeekEfficient(isReadSeekEfficient)
.setSizeBytes(sizeBytes)
.build();
.setLastModifiedMillis(UNKNOWN_LAST_MODIFIED_MILLIS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line breaks backwards compatibility (or is at least inconsistent with the encoder), we should remove this and do it only in the V2 version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that lastModifiedMillis must have some default value. I've chosen here to use -1 so that lastModifiedMillis doesn't have to be nullable, though we can discuss if null makes more sense here.

AutoValue requires that values be set for all members. To define a default value, the AutoValue docs suggest a pattern of calling setters before returning the builder. That's what's going on here.

I don't see that there's a backwards incompatibility here. MetadataCoder still only writes and reads the three existing values (resource id, int, and long). When encoding, it throws away lastModifiedMills and when decoding, it provides the default -1 value.

Do you have thoughts on whether null would be preferred as the default vs. -1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of note, Java's File.lastModified documents that 0 is used if the file does not exist or an I/O error occurs. We could follow that and use 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we let zero as a default. In that case we won't need that and we will still be ok with AutoValue. If we setup any default I think it is better to do it at the core object level Metadata.create() better than at other places. Also in the current case if you put this only in decode then encode won't put the right value. Better to left this class as it was to avoid the risk of breaking stuff..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we let zero as a default. In that case we won't need that and we will still be ok with AutoValue

To be clear, 0 is only used as a default for File.lastModified as used in LocalFileSystem. Other filesystems do not have this same concept of 0 as default. So any way we go, we will have to explicitly set some default value for the field.

I agree with you that it's better to set the default within Metadata rather than only in the coder, so I will make that change. And will also change the default to be 0 to be consistent with File.lastModified behavior.

Also in the current case if you put this only in decode then encode won't put the right value

Exactly. encode does not encode the lastModifiedMillis value; it's thrown away. And decode provides a default. So encode and decode remain matched in the number and types of encoded values, and these are compatible with previous beam versions.

* MetadataCoderV2} for retaining timestamp information.
*/
public class MetadataCoder extends AtomicCoder<Metadata> {
public static final long UNKNOWN_LAST_MODIFIED_MILLIS = -1L;
Copy link
Member

@iemejia iemejia Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove see comment above

@jklukas
Copy link
Contributor Author

jklukas commented Jan 17, 2019

Fixed the typos, converted both MetadataCoder and MetadataCoderV2 to singletons, and added tests.

I responded to the comment on UNKNOWN_LAST_MODIFIED_MILLIS. The tests exercise that the new code and validate that behavior is how I intended (MetadataCoder discards lastModifiedMillis when encoding and provides a default value on decoding). I believe that proves backwards compatibility, but do let me know if there's some additional nuance I'm missing there.

@jklukas jklukas force-pushed the filesystem-meta-last-modified branch from 0c5cb55 to f4118ef Compare January 17, 2019 16:43
@jklukas
Copy link
Contributor Author

jklukas commented Jan 17, 2019

Pushed a new commit that uses 0 for the default, moves the default into Metadata, and documents the behavior.

@jklukas jklukas force-pushed the filesystem-meta-last-modified branch from 919b0b7 to 4ae354b Compare January 17, 2019 17:38
@jklukas jklukas force-pushed the filesystem-meta-last-modified branch from 4ae354b to 6bc7794 Compare January 17, 2019 19:22
Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Fantastic work @jklukas! Thanks for fixing this omission!

iemejia added a commit that referenced this pull request Jan 18, 2019
@iemejia
Copy link
Member

iemejia commented Jan 18, 2019

Merged manually to squash the commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants