Overhaul of SegmentPullers to add consistency and retries by drcrallen · Pull Request #1132 · apache/druid

drcrallen · 2015-02-18T01:31:25Z

The Goal of this is to make the segment pullers a little more general use. Specifically, this uses URIs when possible. URIs have a standard defined format (unlike the free-form loadSpec used previously). This also unifies a lot of the way segments are pulled, so that any datasource should be able to support a zip or gz request in a more standard manner.

General workflow goes like this:

LoadSpec makes sure the correct Puller is called with the correct parameters. This offloads the "standard" interface from the puller to the LoadSpec. The only thing segment information was being used for in the puller was to get the loadSpec, so I simply moved where the guaranteed interface resides since we have an actual LoadSpec now.
The Puller sets up general information like how to make an InputStream, how to find a file name (for .gz files for example), and when to retry.
CompressionUtils does most of the heavy lifting when it can

~~Requires Add URI Handlers and LoadSpec. Deprecate CompressionUtils druid-io/druid-api#37~~ Merged
~~Requires Add Compression and File utilities for Zip and GZ handling metamx/java-util#22~~ Merged

~~I also want to try this on some S3 segments just to make sure it didn't break anything unexpected. But would like feedback on the approach first.~~ Tested, seems to be fine.

~~This MIGHT be subject to http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7036144 Added unit tests Adding more to double-check.~~ Fixed with workaround in java-util

drcrallen · 2015-02-18T01:33:18Z

Was this double-adding segments?

Looked deeper, I think it is a bit redundant but doesn't double-account for segments.

drcrallen · 2015-03-02T22:15:25Z

@fjy : Does this work better for your polymorphic desires?

drcrallen · 2015-03-02T22:16:28Z

I might be missing a binding somewhere, I'll do more testing later this week, but that won't change the overall approach.

xvrl · 2015-03-03T18:25:37Z

can we provide a little bit more context in the description as to what we are trying to achieve and why this is needed? This may not be obvious to an outsider.

drcrallen · 2015-03-03T19:41:39Z

@xvrl : updated, please let me know if that is more clear.

cheddar · 2015-03-03T20:17:11Z

The reason for the separation as it currently exists is to give implementations the ability to define

How they want to store data on their given underlying storage layer
The best way to get at the data that is stored in the underlying storage layer

The reason for the jankiness in the current implementation is because it is mid-migration. I agree that it needs to be improved, but I think that the implementation changes done here lead to less flexbility and a more difficult implementation.

Specifically, if we build an API around URIs, then we are essentially forcing everyone to come up with a mechanism of storing all of the data they might need/want to store inside of a URI, where working with just JSON objects and Java fields can be a lot simpler.

IIRC, the specific problem that needed to be resolved in order to really do this section of code "right" was the DI layer. Now that we have Guice injection integrated with Jackson, we should be able to do a really clean API that gives control over to the person integrating with whatever storage they are integrating with.

Specifically, I'd propose that the LoadSpec API should be

public interface LoadSpec {
  public void get(File outDir)
}

That is, you pass it a directory and it stores the files in there. For our current implementations, the implementation of this loadSpec would essentially get the metadata as well as the relevant "Puller" injected and use those. For future implementations (or things that are amenable to URI/URL semantics), there can be an implementation of this interface which specializes to utilizing URIs instead.

This

Meets the original requirements of giving implementors full control over how they are storing things
Works around issues with things like what if you want to retry a failure but don't have a Throwable? With the API change made in the druid-api library, I now have to create a brand new Throwable, count on the outer logic to catch that and pass it off to my Predicate so that I can choose to retry. That's leaking a lot in the abstraction.
Completely eliminates the DataSegment argument from the loading interface and generifies it to something that should be usable for other storage types as well.

What do you think?

drcrallen · 2015-03-03T20:55:33Z

One reason I've been pushing back against the "just give LoadSpec a directory" approach is that at its core ends up being "Let's not use a standard resource identification schema, let's make our own non-standard schema" which is something I'd like to avoid (edit: unless there's a good reason a standard resource identification scheme won't work).

There are two parts of key points of data encoded in the loadSpec aside from the individual data identifier: 1. The filesystem, 2: the data container (gz, zip, etc)

I understand the desire to have a simple interface, but with this approach I foresee 3 options:

Each possible way of loading storing data has its own LoadSpec for each underlying file system (zip, gz, directory, tar.gz , whatever_other_format X hdfs, s3, file, cassandra)
Each filesystem LoadSpec handles its own file loading in completely non-consistent ways (what we have now) with regards to retries and what formats are supported.
Each possible Filesystem supplies a handle to the data container, and the OmniSegmentLoader figures out how to handle that container properly (what I have proposed)

Obviously I'm partial to 3, but I'm ok with going route 1 above, even though I think its unclean compared to 3, but am very against 2.

cheddar · 2015-03-03T22:18:44Z

So, I don't see an argument against 2 aside from you being very against it. Why is it bad to allow implementations to control the "best" way to store the data and subsequently retrieve it?

Also, why is it bad to allow, for example, HDFS to automatically retry things in its client while s3 doesn't implement that for us so we have to implement it ourselves? If you have something that does retries on its own already and you implement retries on top of it, you are now multiplying out the retries.

On the question of using a URI, your argument is essentially that a URI is a standard thing, so therefore we should take all of the things in the JSON fields and map them into a URI and then write code to map out of that URI back into JSON fields. Fwiw, JSON is also a standard and we already have stuff setup to serialize into and out of it. The fact that the switch to URIs is the equivalent of just adding a new marshalling layer to the code is very clear from looking at the Cassandra implementation (one big glaring thing is the fact that you are choosing not to implement the URI-based method...).

Note, I'm not against using URIs for things that already leverage URIs and I'm also not against having something that is intelligent about various storage formats at the end of those URIs. I am against making the assumption that everything must fit that mold forever and always. If you adopt my interface, the Cassandra stuff can implement that and you can have all of your logic and figuring out of things hidden inside of another concrete implementation of LoadSpec that leverages URIs. That's perfectly fine. If we adopt your changes, we do not have that flexibility.

drcrallen · 2015-03-03T22:52:54Z

The reason I'm against the second option above is because there is a complete lack of consistency in how any data is handled. I am proposing using URI as a standard marshaling schema for resource identification because it is a pretty universally used resource identification schema. Json is a standard for data transfer, but to my knowledge there is no standard for resource identification with regards to Json.

Regarding why retries at the OmniSegmentLoader level: 1. It forces the developer to address retries as an inherent desire for the data. 2. Retry loops at the OmniSegmentLoader level are still controllable by the Puller via the retry predicate.

Regarding "Why is it bad to allow implementations to control the "best" way to store the data and subsequently retrieve it?" It goes back to consistency and what level we want consistency to be enforced. I was under the impression that there should be a consistency in handling of data containers, and the filesystem should be the pluggable portion. From most of your arguments I'm gathering that you feel the only consistency should be that a pull to directory can be called, and if any LoadSpec has to handle that in either wildly different or completely code-copied ways, then that's how its going to be. Please correct me if that's wrong.

cheddar · 2015-03-04T16:41:20Z

@drcrallen Yes, I believe you are correct in capturing my thoughts on the matter. A little bit of extra information is that I was thinking that exposing things in reusable code snippets and allowing the implementations to use them would be beneficial (hence the "CompressionUtils") in lowering the burden of repeated implementations.

That said, I'm not against having an injection point that just implements an FS, allowing for the plugging that you are talking about. I would prefer that both exist (and the docs could generally point people to the FS one and we reserve conversations on the mailing list for telling people about the other one).

One big reason for preferring both is that the way you have changed things right now, the change cannot be made backwards compatible. I'm not sure if anyone is currently adding their own implementation as an extension (there was someone who implemented something on RackSpace that I'm not sure of the status of), but if someone is and they update their version of druid, all of a sudden their code won't work.

If we adjust the internal interfaces in the way I proposed, we can

Have a backwards compatible implementation that uses the old interface
Introduce your new interface (with a different name) and have an implementation that uses that
Implement the Cassandra loader in terms of my proposed interface

This gives us the simplicity of just implementing an FS for the cases where that can be done easily, keeps an injection point for uses that don't lend themselves to URIs as well, and allows us to maintain backwards compatibility with old interface implementations, which can be deprecated and allow us a forward migration strategy.

I.e. hopefully it's a win for everyone and it should clean up various parts of the code where you are checking for URIs and such (the URI implementation can just be pure URI, it doesn't have to be dirtied with legacy support.

drcrallen · 2015-03-04T17:13:56Z

@cheddar : I'm at mesos hackweek this week, so my time to modify this is limited, but I think unloading some of the brute work to a combination of FileUtils / CompressionUtils / RetryUtils is palatable to me. So the overall statement would be like this:

LoadSpec would be a (hopefully thin) helper to make sure the right combination of FileUtils / CompressionUtils / RetryUtils and the correct Puller are called for the given load type.
LoadSpec simply guarantees that a destination directory be passed in as a File type
Any new file system for deep storage would need to make sure that a LoadSpec and the appropriate puller/pusher/killer was available
There are no guarantees as to what types of file/data containers each filesystem can support. Support for containers exists in CompressionUtils.

Does that sound like a reasonable solution statement?

cheddar · 2015-03-05T01:22:37Z

Yes, that sounds reasonable to me. Also, if you want to preserve working through URIs, I think a LoadSpec implementation based on loading a URI can make sense.

drcrallen · 2015-03-09T19:55:53Z

@cheddar : Updated. Can you please take a look at the updated PR description and code and see if that fits in better with your workflow desires?

drcrallen · 2015-03-11T15:47:04Z

This was a weird error I was getting where sometimes beanInstance was being passed as String of the class rather than a Key of the class.

This may not be that odd, the documentation for InjectableValues states that valueId can be an "Object that identifies value to inject; may be a simple name or more complex identifier object, whatever provider needs"

It might make sense to add a comment though as to why this check is needed.

also, instanceof Key / String should be enough here, no?

drcrallen · 2015-03-14T00:51:10Z

@cheddar : Added a basic datatype to the return of the method in LoadSpec

Also added basic sanity check to the SegmentLoader which checks the expected vs the written size of the segments and logs a warning.

fjy · 2015-03-26T22:34:23Z

why are these required?

<scope>test</scope>
for unit tests.

cheddar · 2015-03-26T22:58:01Z

Aside from a few "?"s (and me commenting on the commit instead of the PR... sorry), this looks generally in line with my comments.

drcrallen · 2015-03-26T23:00:03Z

@cheddar : I had some "magic" logic like that previously related to some other stuff, and there was concern raised by @gianm about it.

In this particular case the most concerning part is if it is gzip, would you want it to rename the file to the base name and output to the directory, or are you wanting to pass in a file you want the ungzip'd data to go to?

There is not obvious auto-magic solution in such a case without having flags which only pertain to the gzip case.

drcrallen · 2015-03-26T23:02:37Z

Regarding eliminating .gz code: it is used in some of the QTL stuff.

This PR is a baby step towards giving a more generic and standard interface between data. If I could pull in hadoop's abstract File handling without encountering Hadoop dependency hell I would.

fjy · 2015-03-26T23:05:17Z

can we leave a comment here about why smarter retries are necessary and remove the TODO?

cheddar · 2015-03-27T01:30:30Z

On the decompress thing. I think it can make sense to have the semantics be "give me the directory to decompress files into" and in the gzip case, it keeps the original file name without the .gz (similar to what the unix gunzip does). For zip, it just unzips into the directory.

cheddar · 2015-03-27T01:55:26Z

I'm not sure what this test is trying to verify exactly. It looks like it's trying to verify that Jackson can instantiate an instance. If that is truly the case, then I think there's a lot of extra stuff in the setup that doesn't really help.

At a minimum, I'd recommend just switching the setup to not bother with a Guice injector and just add the GuiceAnnotationIntrospector as well as an InjectableValues that checks to make sure that it gets the Key and then just returns null.

Also, when you register stuff with Jackson, you can just use a SimpleModule and call registerSubType().

I'll make some changes but the test is to verify that the minimal components needed to inject are well defined. It's more of a "Do I really understand all the injection dependencies" rather than "can Jackson do what Jackson does"

drcrallen · 2015-03-30T18:01:20Z

@cheddar : Ping! Last I heard you were pretty good with this one overall but had some small outstanding comments on the PR. I think those have been addressed.

cheddar · 2015-03-30T18:12:39Z

Seems fine to me

* Requires druid-io/druid-api#37 * Requires metamx/java-util#22 * Moves the puller logic to use a more standard workflow going through java-util helpers instead of re-writing the handlers for each impl * General workflow goes like this: 1) LoadSpec makes sure the correct Puller is called with the correct parameters. 2) The Puller sets up general information like how to make an InputStream, how to find a file name (for .gz files for example), and when to retry. 3) CompressionUtils does most of the heavy lifting when it can

Overhaul of SegmentPullers to add consistency and retries

drcrallen reviewed Feb 18, 2015
View reviewed changes

This was referenced Feb 18, 2015

Query time lookup (In Code Review) #1093

Closed

URL extensions #1119

Closed

drcrallen force-pushed the uriSegmentLoaders branch 3 times, most recently from 8e3016d to b1cafbc Compare February 20, 2015 22:08

drcrallen force-pushed the uriSegmentLoaders branch from b1cafbc to 4912d3d Compare March 2, 2015 22:15

drcrallen force-pushed the uriSegmentLoaders branch from 4912d3d to d1369bc Compare March 2, 2015 22:25

xvrl added the Discuss label Mar 3, 2015

drcrallen force-pushed the uriSegmentLoaders branch 2 times, most recently from 4a6efce to 06522fa Compare March 9, 2015 19:52

drcrallen changed the title ~~Add URI handling to SegmentPullers~~ Overhaul of SegmentPullers to add consistency and retries Mar 9, 2015

drcrallen force-pushed the uriSegmentLoaders branch 3 times, most recently from 49ca122 to 58c4df8 Compare March 9, 2015 21:43

drcrallen reviewed Mar 11, 2015
View reviewed changes

himanshug mentioned this pull request Mar 14, 2015

Add Compression and File utilities for Zip and GZ handling metamx/java-util#22

Merged

drcrallen force-pushed the uriSegmentLoaders branch 3 times, most recently from 60108e2 to 34ecfd1 Compare March 23, 2015 20:14

fjy reviewed Mar 26, 2015
View reviewed changes

drcrallen force-pushed the uriSegmentLoaders branch 2 times, most recently from f1420de to df47a23 Compare March 27, 2015 00:23

cheddar reviewed Mar 27, 2015
View reviewed changes

drcrallen force-pushed the uriSegmentLoaders branch from df47a23 to 5cfbb3c Compare March 30, 2015 17:40

drcrallen force-pushed the uriSegmentLoaders branch from 5cfbb3c to 6d407e8 Compare March 30, 2015 19:33

fjy added a commit that referenced this pull request Mar 30, 2015

Merge pull request #1132 from metamx/uriSegmentLoaders

1e38def

Overhaul of SegmentPullers to add consistency and retries

fjy merged commit 1e38def into apache:master Mar 30, 2015

fjy deleted the uriSegmentLoaders branch March 30, 2015 20:04

fjy restored the uriSegmentLoaders branch March 30, 2015 20:15

fjy mentioned this pull request Mar 30, 2015

Revert "Overhaul of SegmentPullers to add consistency and retries" #1250

Merged

drcrallen mentioned this pull request Apr 27, 2015

Use URI scheme instead of FileSystem type to write the correct value in the DB #726

Closed

xvrl removed the Discuss label May 13, 2015

Conversation

drcrallen commented Feb 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Mar 2, 2015

Uh oh!

drcrallen commented Mar 2, 2015

Uh oh!

xvrl commented Mar 3, 2015

Uh oh!

drcrallen commented Mar 3, 2015

Uh oh!

cheddar commented Mar 3, 2015

Uh oh!

drcrallen commented Mar 3, 2015

Uh oh!

cheddar commented Mar 3, 2015

Uh oh!

drcrallen commented Mar 3, 2015

Uh oh!

cheddar commented Mar 4, 2015

Uh oh!

drcrallen commented Mar 4, 2015

Uh oh!

cheddar commented Mar 5, 2015

Uh oh!

drcrallen commented Mar 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Mar 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheddar commented Mar 26, 2015

Uh oh!

drcrallen commented Mar 26, 2015

Uh oh!

drcrallen commented Mar 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheddar commented Mar 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Mar 30, 2015

Uh oh!

cheddar commented Mar 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants