Overhaul of SegmentPullers to add consistency and retries#1132
Overhaul of SegmentPullers to add consistency and retries#1132fjy merged 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Was this double-adding segments?
There was a problem hiding this comment.
Looked deeper, I think it is a bit redundant but doesn't double-account for segments.
8e3016d to
b1cafbc
Compare
b1cafbc to
4912d3d
Compare
|
@fjy : Does this work better for your polymorphic desires? |
|
I might be missing a binding somewhere, I'll do more testing later this week, but that won't change the overall approach. |
4912d3d to
d1369bc
Compare
|
can we provide a little bit more context in the description as to what we are trying to achieve and why this is needed? This may not be obvious to an outsider. |
|
@xvrl : updated, please let me know if that is more clear. |
|
The reason for the separation as it currently exists is to give implementations the ability to define
The reason for the jankiness in the current implementation is because it is mid-migration. I agree that it needs to be improved, but I think that the implementation changes done here lead to less flexbility and a more difficult implementation. Specifically, if we build an API around URIs, then we are essentially forcing everyone to come up with a mechanism of storing all of the data they might need/want to store inside of a URI, where working with just JSON objects and Java fields can be a lot simpler. IIRC, the specific problem that needed to be resolved in order to really do this section of code "right" was the DI layer. Now that we have Guice injection integrated with Jackson, we should be able to do a really clean API that gives control over to the person integrating with whatever storage they are integrating with. Specifically, I'd propose that the LoadSpec API should be That is, you pass it a directory and it stores the files in there. For our current implementations, the implementation of this loadSpec would essentially get the metadata as well as the relevant "Puller" injected and use those. For future implementations (or things that are amenable to URI/URL semantics), there can be an implementation of this interface which specializes to utilizing URIs instead. This
What do you think? |
|
One reason I've been pushing back against the "just give LoadSpec a directory" approach is that at its core ends up being "Let's not use a standard resource identification schema, let's make our own non-standard schema" which is something I'd like to avoid (edit: unless there's a good reason a standard resource identification scheme won't work). There are two parts of key points of data encoded in the loadSpec aside from the individual data identifier: 1. The filesystem, 2: the data container (gz, zip, etc) I understand the desire to have a simple interface, but with this approach I foresee 3 options:
Obviously I'm partial to 3, but I'm ok with going route 1 above, even though I think its unclean compared to 3, but am very against 2. |
|
So, I don't see an argument against 2 aside from you being very against it. Why is it bad to allow implementations to control the "best" way to store the data and subsequently retrieve it? Also, why is it bad to allow, for example, HDFS to automatically retry things in its client while s3 doesn't implement that for us so we have to implement it ourselves? If you have something that does retries on its own already and you implement retries on top of it, you are now multiplying out the retries. On the question of using a URI, your argument is essentially that a URI is a standard thing, so therefore we should take all of the things in the JSON fields and map them into a URI and then write code to map out of that URI back into JSON fields. Fwiw, JSON is also a standard and we already have stuff setup to serialize into and out of it. The fact that the switch to URIs is the equivalent of just adding a new marshalling layer to the code is very clear from looking at the Cassandra implementation (one big glaring thing is the fact that you are choosing not to implement the URI-based method...). Note, I'm not against using URIs for things that already leverage URIs and I'm also not against having something that is intelligent about various storage formats at the end of those URIs. I am against making the assumption that everything must fit that mold forever and always. If you adopt my interface, the Cassandra stuff can implement that and you can have all of your logic and figuring out of things hidden inside of another concrete implementation of LoadSpec that leverages URIs. That's perfectly fine. If we adopt your changes, we do not have that flexibility. |
|
The reason I'm against the second option above is because there is a complete lack of consistency in how any data is handled. I am proposing using URI as a standard marshaling schema for resource identification because it is a pretty universally used resource identification schema. Json is a standard for data transfer, but to my knowledge there is no standard for resource identification with regards to Json. Regarding why retries at the OmniSegmentLoader level: 1. It forces the developer to address retries as an inherent desire for the data. 2. Retry loops at the OmniSegmentLoader level are still controllable by the Puller via the retry predicate. Regarding "Why is it bad to allow implementations to control the "best" way to store the data and subsequently retrieve it?" It goes back to consistency and what level we want consistency to be enforced. I was under the impression that there should be a consistency in handling of data containers, and the filesystem should be the pluggable portion. From most of your arguments I'm gathering that you feel the only consistency should be that a pull to directory can be called, and if any LoadSpec has to handle that in either wildly different or completely code-copied ways, then that's how its going to be. Please correct me if that's wrong. |
|
@drcrallen Yes, I believe you are correct in capturing my thoughts on the matter. A little bit of extra information is that I was thinking that exposing things in reusable code snippets and allowing the implementations to use them would be beneficial (hence the "CompressionUtils") in lowering the burden of repeated implementations. That said, I'm not against having an injection point that just implements an FS, allowing for the plugging that you are talking about. I would prefer that both exist (and the docs could generally point people to the FS one and we reserve conversations on the mailing list for telling people about the other one). One big reason for preferring both is that the way you have changed things right now, the change cannot be made backwards compatible. I'm not sure if anyone is currently adding their own implementation as an extension (there was someone who implemented something on RackSpace that I'm not sure of the status of), but if someone is and they update their version of druid, all of a sudden their code won't work. If we adjust the internal interfaces in the way I proposed, we can
This gives us the simplicity of just implementing an FS for the cases where that can be done easily, keeps an injection point for uses that don't lend themselves to URIs as well, and allows us to maintain backwards compatibility with old interface implementations, which can be deprecated and allow us a forward migration strategy. I.e. hopefully it's a win for everyone and it should clean up various parts of the code where you are checking for URIs and such (the URI implementation can just be pure URI, it doesn't have to be dirtied with legacy support. |
|
@cheddar : I'm at mesos hackweek this week, so my time to modify this is limited, but I think unloading some of the brute work to a combination of FileUtils / CompressionUtils / RetryUtils is palatable to me. So the overall statement would be like this:
Does that sound like a reasonable solution statement? |
|
Yes, that sounds reasonable to me. Also, if you want to preserve working through URIs, I think a LoadSpec implementation based on loading a URI can make sense. |
4a6efce to
06522fa
Compare
|
@cheddar : Updated. Can you please take a look at the updated PR description and code and see if that fits in better with your workflow desires? |
49ca122 to
58c4df8
Compare
There was a problem hiding this comment.
This was a weird error I was getting where sometimes beanInstance was being passed as String of the class rather than a Key of the class.
There was a problem hiding this comment.
This may not be that odd, the documentation for InjectableValues states that valueId can be an "Object that identifies value to inject; may be a simple name or more complex identifier object, whatever provider needs"
It might make sense to add a comment though as to why this check is needed.
There was a problem hiding this comment.
also, instanceof Key / String should be enough here, no?
|
@cheddar : Added a basic datatype to the return of the method in LoadSpec Also added basic sanity check to the SegmentLoader which checks the expected vs the written size of the segments and logs a warning. |
60108e2 to
34ecfd1
Compare
There was a problem hiding this comment.
<scope>test</scope>
for unit tests.
|
Aside from a few "?"s (and me commenting on the commit instead of the PR... sorry), this looks generally in line with my comments. |
|
@cheddar : I had some "magic" logic like that previously related to some other stuff, and there was concern raised by @gianm about it. In this particular case the most concerning part is if it is gzip, would you want it to rename the file to the base name and output to the directory, or are you wanting to pass in a file you want the ungzip'd data to go to? There is not obvious auto-magic solution in such a case without having flags which only pertain to the gzip case. |
|
Regarding eliminating This PR is a baby step towards giving a more generic and standard interface between data. If I could pull in hadoop's abstract File handling without encountering Hadoop dependency hell I would. |
There was a problem hiding this comment.
can we leave a comment here about why smarter retries are necessary and remove the TODO?
f1420de to
df47a23
Compare
|
On the decompress thing. I think it can make sense to have the semantics be "give me the directory to decompress files into" and in the gzip case, it keeps the original file name without the |
There was a problem hiding this comment.
I'm not sure what this test is trying to verify exactly. It looks like it's trying to verify that Jackson can instantiate an instance. If that is truly the case, then I think there's a lot of extra stuff in the setup that doesn't really help.
At a minimum, I'd recommend just switching the setup to not bother with a Guice injector and just add the GuiceAnnotationIntrospector as well as an InjectableValues that checks to make sure that it gets the Key and then just returns null.
Also, when you register stuff with Jackson, you can just use a SimpleModule and call registerSubType().
There was a problem hiding this comment.
I'll make some changes but the test is to verify that the minimal components needed to inject are well defined. It's more of a "Do I really understand all the injection dependencies" rather than "can Jackson do what Jackson does"
df47a23 to
5cfbb3c
Compare
|
@cheddar : Ping! Last I heard you were pretty good with this one overall but had some small outstanding comments on the PR. I think those have been addressed. |
|
Seems fine to me |
* Requires druid-io/druid-api#37 * Requires metamx/java-util#22 * Moves the puller logic to use a more standard workflow going through java-util helpers instead of re-writing the handlers for each impl * General workflow goes like this: 1) LoadSpec makes sure the correct Puller is called with the correct parameters. 2) The Puller sets up general information like how to make an InputStream, how to find a file name (for .gz files for example), and when to retry. 3) CompressionUtils does most of the heavy lifting when it can
5cfbb3c to
6d407e8
Compare
Overhaul of SegmentPullers to add consistency and retries
The Goal of this is to make the segment pullers a little more general use. Specifically, this uses URIs when possible. URIs have a standard defined format (unlike the free-form loadSpec used previously). This also unifies a lot of the way segments are pulled, so that any datasource should be able to support a zip or gz request in a more standard manner.
General workflow goes like this:
Requires Add URI Handlers and LoadSpec. Deprecate CompressionUtils druid-io/druid-api#37MergedRequires Add Compression and File utilities for Zip and GZ handling metamx/java-util#22MergedI also want to try this on some S3 segments just to make sure it didn't break anything unexpected. But would like feedback on the approach first.Tested, seems to be fine.This MIGHT be subject to http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7036144 Added unit tests Adding more to double-check.Fixed with workaround in java-util