Skip to content

Overhaul of SegmentPullers to add consistency and retries#1132

Merged
fjy merged 1 commit intoapache:masterfrom
metamx:uriSegmentLoaders
Mar 30, 2015
Merged

Overhaul of SegmentPullers to add consistency and retries#1132
fjy merged 1 commit intoapache:masterfrom
metamx:uriSegmentLoaders

Conversation

@drcrallen
Copy link
Copy Markdown
Contributor

The Goal of this is to make the segment pullers a little more general use. Specifically, this uses URIs when possible. URIs have a standard defined format (unlike the free-form loadSpec used previously). This also unifies a lot of the way segments are pulled, so that any datasource should be able to support a zip or gz request in a more standard manner.

General workflow goes like this:

  1. LoadSpec makes sure the correct Puller is called with the correct parameters. This offloads the "standard" interface from the puller to the LoadSpec. The only thing segment information was being used for in the puller was to get the loadSpec, so I simply moved where the guaranteed interface resides since we have an actual LoadSpec now.
  2. The Puller sets up general information like how to make an InputStream, how to find a file name (for .gz files for example), and when to retry.
  3. CompressionUtils does most of the heavy lifting when it can

I also want to try this on some S3 segments just to make sure it didn't break anything unexpected. But would like feedback on the approach first. Tested, seems to be fine.

This MIGHT be subject to http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7036144 Added unit tests Adding more to double-check. Fixed with workaround in java-util

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this double-adding segments?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked deeper, I think it is a bit redundant but doesn't double-account for segments.

This was referenced Feb 18, 2015
@drcrallen drcrallen force-pushed the uriSegmentLoaders branch 3 times, most recently from 8e3016d to b1cafbc Compare February 20, 2015 22:08
@drcrallen drcrallen force-pushed the uriSegmentLoaders branch from b1cafbc to 4912d3d Compare March 2, 2015 22:15
@drcrallen
Copy link
Copy Markdown
Contributor Author

@fjy : Does this work better for your polymorphic desires?

@drcrallen
Copy link
Copy Markdown
Contributor Author

I might be missing a binding somewhere, I'll do more testing later this week, but that won't change the overall approach.

@drcrallen drcrallen force-pushed the uriSegmentLoaders branch from 4912d3d to d1369bc Compare March 2, 2015 22:25
@xvrl
Copy link
Copy Markdown
Member

xvrl commented Mar 3, 2015

can we provide a little bit more context in the description as to what we are trying to achieve and why this is needed? This may not be obvious to an outsider.

@drcrallen
Copy link
Copy Markdown
Contributor Author

@xvrl : updated, please let me know if that is more clear.

@xvrl xvrl added the Discuss label Mar 3, 2015
@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Mar 3, 2015

The reason for the separation as it currently exists is to give implementations the ability to define

  1. How they want to store data on their given underlying storage layer
  2. The best way to get at the data that is stored in the underlying storage layer

The reason for the jankiness in the current implementation is because it is mid-migration. I agree that it needs to be improved, but I think that the implementation changes done here lead to less flexbility and a more difficult implementation.

Specifically, if we build an API around URIs, then we are essentially forcing everyone to come up with a mechanism of storing all of the data they might need/want to store inside of a URI, where working with just JSON objects and Java fields can be a lot simpler.

IIRC, the specific problem that needed to be resolved in order to really do this section of code "right" was the DI layer. Now that we have Guice injection integrated with Jackson, we should be able to do a really clean API that gives control over to the person integrating with whatever storage they are integrating with.

Specifically, I'd propose that the LoadSpec API should be

public interface LoadSpec {
  public void get(File outDir)
}

That is, you pass it a directory and it stores the files in there. For our current implementations, the implementation of this loadSpec would essentially get the metadata as well as the relevant "Puller" injected and use those. For future implementations (or things that are amenable to URI/URL semantics), there can be an implementation of this interface which specializes to utilizing URIs instead.

This

  1. Meets the original requirements of giving implementors full control over how they are storing things
  2. Works around issues with things like what if you want to retry a failure but don't have a Throwable? With the API change made in the druid-api library, I now have to create a brand new Throwable, count on the outer logic to catch that and pass it off to my Predicate so that I can choose to retry. That's leaking a lot in the abstraction.
  3. Completely eliminates the DataSegment argument from the loading interface and generifies it to something that should be usable for other storage types as well.

What do you think?

@drcrallen
Copy link
Copy Markdown
Contributor Author

One reason I've been pushing back against the "just give LoadSpec a directory" approach is that at its core ends up being "Let's not use a standard resource identification schema, let's make our own non-standard schema" which is something I'd like to avoid (edit: unless there's a good reason a standard resource identification scheme won't work).

There are two parts of key points of data encoded in the loadSpec aside from the individual data identifier: 1. The filesystem, 2: the data container (gz, zip, etc)

I understand the desire to have a simple interface, but with this approach I foresee 3 options:

  1. Each possible way of loading storing data has its own LoadSpec for each underlying file system (zip, gz, directory, tar.gz , whatever_other_format X hdfs, s3, file, cassandra)
  2. Each filesystem LoadSpec handles its own file loading in completely non-consistent ways (what we have now) with regards to retries and what formats are supported.
  3. Each possible Filesystem supplies a handle to the data container, and the OmniSegmentLoader figures out how to handle that container properly (what I have proposed)

Obviously I'm partial to 3, but I'm ok with going route 1 above, even though I think its unclean compared to 3, but am very against 2.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Mar 3, 2015

So, I don't see an argument against 2 aside from you being very against it. Why is it bad to allow implementations to control the "best" way to store the data and subsequently retrieve it?

Also, why is it bad to allow, for example, HDFS to automatically retry things in its client while s3 doesn't implement that for us so we have to implement it ourselves? If you have something that does retries on its own already and you implement retries on top of it, you are now multiplying out the retries.

On the question of using a URI, your argument is essentially that a URI is a standard thing, so therefore we should take all of the things in the JSON fields and map them into a URI and then write code to map out of that URI back into JSON fields. Fwiw, JSON is also a standard and we already have stuff setup to serialize into and out of it. The fact that the switch to URIs is the equivalent of just adding a new marshalling layer to the code is very clear from looking at the Cassandra implementation (one big glaring thing is the fact that you are choosing not to implement the URI-based method...).

Note, I'm not against using URIs for things that already leverage URIs and I'm also not against having something that is intelligent about various storage formats at the end of those URIs. I am against making the assumption that everything must fit that mold forever and always. If you adopt my interface, the Cassandra stuff can implement that and you can have all of your logic and figuring out of things hidden inside of another concrete implementation of LoadSpec that leverages URIs. That's perfectly fine. If we adopt your changes, we do not have that flexibility.

@drcrallen
Copy link
Copy Markdown
Contributor Author

The reason I'm against the second option above is because there is a complete lack of consistency in how any data is handled. I am proposing using URI as a standard marshaling schema for resource identification because it is a pretty universally used resource identification schema. Json is a standard for data transfer, but to my knowledge there is no standard for resource identification with regards to Json.

Regarding why retries at the OmniSegmentLoader level: 1. It forces the developer to address retries as an inherent desire for the data. 2. Retry loops at the OmniSegmentLoader level are still controllable by the Puller via the retry predicate.

Regarding "Why is it bad to allow implementations to control the "best" way to store the data and subsequently retrieve it?" It goes back to consistency and what level we want consistency to be enforced. I was under the impression that there should be a consistency in handling of data containers, and the filesystem should be the pluggable portion. From most of your arguments I'm gathering that you feel the only consistency should be that a pull to directory can be called, and if any LoadSpec has to handle that in either wildly different or completely code-copied ways, then that's how its going to be. Please correct me if that's wrong.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Mar 4, 2015

@drcrallen Yes, I believe you are correct in capturing my thoughts on the matter. A little bit of extra information is that I was thinking that exposing things in reusable code snippets and allowing the implementations to use them would be beneficial (hence the "CompressionUtils") in lowering the burden of repeated implementations.

That said, I'm not against having an injection point that just implements an FS, allowing for the plugging that you are talking about. I would prefer that both exist (and the docs could generally point people to the FS one and we reserve conversations on the mailing list for telling people about the other one).

One big reason for preferring both is that the way you have changed things right now, the change cannot be made backwards compatible. I'm not sure if anyone is currently adding their own implementation as an extension (there was someone who implemented something on RackSpace that I'm not sure of the status of), but if someone is and they update their version of druid, all of a sudden their code won't work.

If we adjust the internal interfaces in the way I proposed, we can

  1. Have a backwards compatible implementation that uses the old interface
  2. Introduce your new interface (with a different name) and have an implementation that uses that
  3. Implement the Cassandra loader in terms of my proposed interface

This gives us the simplicity of just implementing an FS for the cases where that can be done easily, keeps an injection point for uses that don't lend themselves to URIs as well, and allows us to maintain backwards compatibility with old interface implementations, which can be deprecated and allow us a forward migration strategy.

I.e. hopefully it's a win for everyone and it should clean up various parts of the code where you are checking for URIs and such (the URI implementation can just be pure URI, it doesn't have to be dirtied with legacy support.

@drcrallen
Copy link
Copy Markdown
Contributor Author

@cheddar : I'm at mesos hackweek this week, so my time to modify this is limited, but I think unloading some of the brute work to a combination of FileUtils / CompressionUtils / RetryUtils is palatable to me. So the overall statement would be like this:

  1. LoadSpec would be a (hopefully thin) helper to make sure the right combination of FileUtils / CompressionUtils / RetryUtils and the correct Puller are called for the given load type.
  2. LoadSpec simply guarantees that a destination directory be passed in as a File type
  3. Any new file system for deep storage would need to make sure that a LoadSpec and the appropriate puller/pusher/killer was available
  4. There are no guarantees as to what types of file/data containers each filesystem can support. Support for containers exists in CompressionUtils.

Does that sound like a reasonable solution statement?

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Mar 5, 2015

Yes, that sounds reasonable to me. Also, if you want to preserve working through URIs, I think a LoadSpec implementation based on loading a URI can make sense.

@drcrallen drcrallen force-pushed the uriSegmentLoaders branch 2 times, most recently from 4a6efce to 06522fa Compare March 9, 2015 19:52
@drcrallen drcrallen changed the title Add URI handling to SegmentPullers Overhaul of SegmentPullers to add consistency and retries Mar 9, 2015
@drcrallen
Copy link
Copy Markdown
Contributor Author

@cheddar : Updated. Can you please take a look at the updated PR description and code and see if that fits in better with your workflow desires?

@drcrallen drcrallen force-pushed the uriSegmentLoaders branch 3 times, most recently from 49ca122 to 58c4df8 Compare March 9, 2015 21:43
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a weird error I was getting where sometimes beanInstance was being passed as String of the class rather than a Key of the class.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be that odd, the documentation for InjectableValues states that valueId can be an "Object that identifies value to inject; may be a simple name or more complex identifier object, whatever provider needs"

It might make sense to add a comment though as to why this check is needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, instanceof Key / String should be enough here, no?

@drcrallen
Copy link
Copy Markdown
Contributor Author

@cheddar : Added a basic datatype to the return of the method in LoadSpec

Also added basic sanity check to the SegmentLoader which checks the expected vs the written size of the segments and logs a warning.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these required?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<scope>test</scope>
for unit tests.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Mar 26, 2015

Aside from a few "?"s (and me commenting on the commit instead of the PR... sorry), this looks generally in line with my comments.

@drcrallen
Copy link
Copy Markdown
Contributor Author

@cheddar : I had some "magic" logic like that previously related to some other stuff, and there was concern raised by @gianm about it.

In this particular case the most concerning part is if it is gzip, would you want it to rename the file to the base name and output to the directory, or are you wanting to pass in a file you want the ungzip'd data to go to?

There is not obvious auto-magic solution in such a case without having flags which only pertain to the gzip case.

@drcrallen
Copy link
Copy Markdown
Contributor Author

Regarding eliminating .gz code: it is used in some of the QTL stuff.

This PR is a baby step towards giving a more generic and standard interface between data. If I could pull in hadoop's abstract File handling without encountering Hadoop dependency hell I would.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we leave a comment here about why smarter retries are necessary and remove the TODO?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@drcrallen drcrallen force-pushed the uriSegmentLoaders branch 2 times, most recently from f1420de to df47a23 Compare March 27, 2015 00:23
@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Mar 27, 2015

On the decompress thing. I think it can make sense to have the semantics be "give me the directory to decompress files into" and in the gzip case, it keeps the original file name without the .gz (similar to what the unix gunzip does). For zip, it just unzips into the directory.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this test is trying to verify exactly. It looks like it's trying to verify that Jackson can instantiate an instance. If that is truly the case, then I think there's a lot of extra stuff in the setup that doesn't really help.

At a minimum, I'd recommend just switching the setup to not bother with a Guice injector and just add the GuiceAnnotationIntrospector as well as an InjectableValues that checks to make sure that it gets the Key and then just returns null.

Also, when you register stuff with Jackson, you can just use a SimpleModule and call registerSubType().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make some changes but the test is to verify that the minimal components needed to inject are well defined. It's more of a "Do I really understand all the injection dependencies" rather than "can Jackson do what Jackson does"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@drcrallen
Copy link
Copy Markdown
Contributor Author

@cheddar : Ping! Last I heard you were pretty good with this one overall but had some small outstanding comments on the PR. I think those have been addressed.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Mar 30, 2015

Seems fine to me

* Requires druid-io/druid-api#37
* Requires metamx/java-util#22
* Moves the puller logic to use a more standard workflow going through java-util helpers instead of re-writing the handlers for each impl
  * General workflow goes like this: 1) LoadSpec makes sure the correct Puller is called with the correct parameters. 2) The Puller sets up general information like how to make an InputStream, how to find a file name (for .gz files for example), and when to retry. 3) CompressionUtils does most of the heavy lifting when it can
fjy added a commit that referenced this pull request Mar 30, 2015
Overhaul of SegmentPullers to add consistency and retries
@fjy fjy merged commit 1e38def into apache:master Mar 30, 2015
@fjy fjy deleted the uriSegmentLoaders branch March 30, 2015 20:04
@fjy fjy restored the uriSegmentLoaders branch March 30, 2015 20:15
@xvrl xvrl removed the Discuss label May 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants