Add Hadoop Converter Job and task#1351
Conversation
There was a problem hiding this comment.
I'm not sure I understand why we need a new module here?
There was a problem hiding this comment.
DerbyEmbeddedConnector uses org.apache.derby.jdbc.EmbeddedDriver instead of the client driver. Another alternative is to have a setting that allows the configuration of the driver class.
There was a problem hiding this comment.
I managed to eliminate it and add a new test @Rule for anything that needs a metadata connector
|
If I'm understanding this correctly, this looks like a way to run a conversion task over Hadoop MR. That is, logically, it is executing tasks using Hadoop MR (and ultimately YARN) as the task manager instead of using whatever the Indexing Service is using. I'm not against this, but I also kinda wonder if we shouldn't make a generic "run tasks on hadoop" job and then have the conversion task be a part of that? |
|
@cheddar : Yes, that is a longer goal. Some of the aspects of this PR will be combined with the hadoop task once this has proven stable. A future goal would be to make a better task container that can run on the indexing service, yarn, or mesos. |
|
I had discussed with @xvrl briefly about what to do regarding the metadata update. I'm in agreement with his point of view that only allowing the task to work as an indexing service task (and NOT as a standalone hadoop job) would greatly simply things overall. @cheddar : is there any objection to simply having this as ONLY an indexing task? (EDIT: indexing task which spawns a hadoop job, meaning it requires the indexing service to run on Hadoop) |
|
The existing conversion tasks is also not standalone, so I don't see why this one would need to be. It would also greatly simplify this PR to remove all metadata rated stuff, since that's unrelated to the task itself if we remove the standalone option. |
There was a problem hiding this comment.
can we reuse code from JobHelper here ?
There was a problem hiding this comment.
No, not without modification. JobHelper relies exclusively on HadoopDruidIndexerConfig and not interfaces to something more abstract.
I tried not to touch any existing hadoop codepaths until this is stable because hadoop is incredibly picky.
There was a problem hiding this comment.
@drcrallen it should be easy to reuse the JobHelper code for this. We can simply change the JobHelper.setupClassPath method to pass the workingPath as opposed to the entire config.
There was a problem hiding this comment.
I'll be happy to set it, but tried to touch the existing hadoop stuff as little as possible.
If this PR is generally agreed upon and it proves pretty stable in our tests then I'll be happy to migrate existing hadoop stuff over to this PR's "framework" of doing things.
There was a problem hiding this comment.
I would rather use the existing code unless there is a good reason to rewrite things. Something little changes can make a big differences and for things that can be re-used I would prefer we take the old code unless there is a good reason to rewrite it, or because it would be difficult to refactor.
There was a problem hiding this comment.
I'm just going to refactor it as part of this PR.
There was a problem hiding this comment.
Splitting into its own PR
There was a problem hiding this comment.
Fixed having some more common stuff
05c88ec to
c62a1d8
Compare
|
Travis failed due to #1393 restarting |
There was a problem hiding this comment.
can we just call this(null, null, null, ...) to make it clear the new constructor is the main one?
a642675 to
673bc47
Compare
There was a problem hiding this comment.
can we comment these as noop as well?
There was a problem hiding this comment.
it is more conventional to implement "org.apache.hadoop.io.Writable" as well so that you don't have to create and setup serde separately. I think that will reduce some code(DataSegmentSplitSerializer) and setupSerializers(..) method.
4f626d9 to
2f1a6d5
Compare
There was a problem hiding this comment.
how much of this method is copypasta from guava?
There was a problem hiding this comment.
Zero, it is actually because com.metamx.common.CompressionUtils uses byte source and byte sink in the "useful" methods. Also com.google.common.io.ByteStreams#copy(java.io.InputStream, java.io.OutputStream) is used quite a bit in there.
01b5128 to
3df16c7
Compare
There was a problem hiding this comment.
derbynet in the parent pom already depends on derby, so we can just use derbynet and remove the version here.
There was a problem hiding this comment.
was only needed when the derby server rule was present. since it has vanished this can to. Will fix.
* Fixes apache#1363 * Add extra utils in JobHelper based on PR feedback
|
+1 |
Add Hadoop Converter Job and task
|
Conversations directly with FJ and others who have commented on this PR show that everyone is generally ok with this, so I went ahead and merged. |
The following should merge first:
#1367 (done)
#1366 (done)
#1428 (done)
Then I'll rebase this one.