Skip to content

ability to build and load custom segment in realtime node#4448

Closed
kaijianding wants to merge 3 commits intoapache:masterfrom
kaijianding:customSegment
Closed

ability to build and load custom segment in realtime node#4448
kaijianding wants to merge 3 commits intoapache:masterfrom
kaijianding:customSegment

Conversation

@kaijianding
Copy link
Copy Markdown
Contributor

It's a follow up of #2965 #3901

  1. introduce sinkFactory to use custom sink to handle events, like not using onHeapIncrementalIndex to hold events, but still using storageAdapter interface when query
  2. introduce customIndexMerger to build custom segment when persist and merge
  3. can load custom segment when bootstrapSinksFromDisk and after persist success

@kaijianding kaijianding force-pushed the customSegment branch 2 times, most recently from 618ab1a to 7080bbb Compare June 22, 2017 19:16
@leventov
Copy link
Copy Markdown
Member

@kaijianding you are doing a lot of PRs, could you also review other people's PRs? And then become a committer?

@kaijianding
Copy link
Copy Markdown
Contributor Author

Sure, will review other's PRs and glad to hear that I'm close to being a committer @leventov

@leventov
Copy link
Copy Markdown
Member

@kaijianding I'm not sure about closeness and it's not for me to decide because I'm not a PMC member, but making reviews is essential because it's a painful aspect in the project that not enough reviews are done. When you are making a lot of PRs you demand reviews but not "contribute them back".

@KurtYoung KurtYoung self-assigned this Jun 23, 2017
@KurtYoung
Copy link
Copy Markdown
Contributor

I will try to review this. Sorry for not having much time to do the review.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jun 23, 2017

cc @cheddar @pjain1 who may be able to help review as well due to previous work in the area.

@pjain1
Copy link
Copy Markdown
Member

pjain1 commented Jun 23, 2017

I can give it a try next week

@kaijianding kaijianding force-pushed the customSegment branch 19 times, most recently from b27a368 to d31abb7 Compare June 27, 2017 06:01
@leventov leventov self-assigned this Jun 28, 2017
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the differences between loadSegment().asQueryableIndex() and just loadIndex()?
I noticed there still exists plenty usage of loadIndex(), should they also be changed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every place want to load index from file should call loadSegment instead of loadIndex. inside loadSegment, it will use the customized segmentizerFactory to map the file to index.
Most usages of loadIndex() are in tests and benchmarks, so I don't modify them.
I will double check the rest usages to make sure every place calling loadIndex is replaced by loadSegment()

Copy link
Copy Markdown
Contributor

@KurtYoung KurtYoung Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If loadSegment is the only correct way to load a segment, then i would suggest you to delete the `loadIndex' method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrite the indexIO.loadIndex() to use segmentizerFactory to load index from file, then there is no need to modify all the place calling indexIO.loadIndex() now

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, otherwise the test will fail.

  public AlertBuilder makeAlert(Throwable t, String message, Object... objects)
  {
    if (emitter == null) {
      final String errorMessage = String.format(
          "Emitter not initialized!  Cannot alert.  Please make sure to call %s.registerEmitter()", this.getClass()
      );

the emitter is registered in EmittingLogger.registerEmitter

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check this? I commented this line and the test passed. BTW, why this test will not fail before you added this, and why the test case would enter the logic about making any alerts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after rewrite indexIO.loadIndex(), there is no need to modify this anymore

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the customIndexMerger may need to use Metadata, like in my customIndexMerger implement.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this until we see a real customized index merger?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my implement want to sub class SmooshedFileMapper, so need it can be extended.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaijianding SmooshedFileMapper is not a part of public api, and could be changed anytime so that your subclass will break. Worse, the behaviour may break, however without compilation errors, so you won't notice it. So if you want something similar to SmooshedFileMapper in your custom SegmentizerFactory, I strongly suggest to copy SmooshedFileMapper entirely into your codebase and tailor it for your needs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to use V9IndexLoader.load(SmooshedFileMapper smooshedFiles, File inDir, ObjectMapper mapper), this method requires SmooshedFileMapper.
It's fine SmooshedFileMapper can change behaviors at any time, I extend SmooshedFileMapper in my extension and will have unit test to ensure everything is working properly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this related to your proposed change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is used to load persisted segment in realtime node

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to loadIndex()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is some kind of performance improvement?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is some cleanup to StringDimensionMergerV9.java, in this part, only dimValuesList is needed, but more info is required(the adapters) than needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is unrelated to this PR, right? Can we separate this to another PR?

Copy link
Copy Markdown
Contributor

@KurtYoung KurtYoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this PR contains 3 different changes:

  1. make SegmentizerFactory works in some cases
  2. introduce sinkFactory and customIndexMeger
  3. some performance improvement

I think these 3 things are completely independent, can you split this PR into three?

@kaijianding
Copy link
Copy Markdown
Contributor Author

The purpose of this PR is to provide the ability to build and load customized segment. SinkFactory and customizeIndexMerger are used to build customized segment, and the indexIo.loadSegment() is to load the customized segment So I think the 3 modifications are better in one PR. @KurtYoung

@KurtYoung
Copy link
Copy Markdown
Contributor

Make sense except the third point. Overall it looks good to me. @leventov Do you have any design concerns?

@kaijianding kaijianding force-pushed the customSegment branch 4 times, most recently from 7053cf0 to 4f82eb8 Compare June 28, 2017 18:47
@Deprecated
private final long handoffConditionTimeout;
private final boolean resetOffsetAutomatically;
private final SinkFactory sinkFactory;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract an umbrella abstraction like "SegmentStrategy" with createSinkFactory() and createIndexMerger() methods, and inject/serialize/deserialize only it, here and in other places.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

}
return indexSpec != null ? indexSpec.equals(that.indexSpec) : that.indexSpec == null;

if (indexSpec != null ? !indexSpec.equals(that.indexSpec) : that.indexSpec != null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor to use Objects.equals()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's generated by Intellij Idea, but sure, will modify it.

Copy link
Copy Markdown
Member

@leventov leventov Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can generate the forms that I suggested with intelliJ as well, need to choose another option in generation dialog. Once you choose it, it will be the default later.

result = 31 * result + (reportParseExceptions ? 1 : 0);
result = 31 * result + (int) (handoffConditionTimeout ^ (handoffConditionTimeout >>> 32));
result = 31 * result + (resetOffsetAutomatically ? 1 : 0);
result = 31 * result + (sinkFactory != null ? sinkFactory.hashCode() : 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor to use Objects.hash()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

// Set of spilled segments. Will be merged at the end.
final Set<File> spilled = Sets.newHashSet();

// IndexMerger implementation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not useful comment

final Set<File> spilled = Sets.newHashSet();

// IndexMerger implementation.
final IndexMerger theIndexMerger = config.getCustomIndexMerger() != null
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding null handling - doesn't Jackson handle it up front, when you specify defaultImpl=... on the IndexMerger interface (or SegmentStrategy for that matter, if you apply suggestion above)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually the default implement is provided in the class context, like indexMergerV9 here. Also the TaskToolbox has the default implement, like toolbox.getIndexMergerV9().
So I think it's better to regard the old code here to get the default implement.

}

static class V9IndexLoader implements IndexLoader
public static class V9IndexLoader implements IndexLoader
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why public?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments below

{
return getSegmentizerFactory(inDir).loadIndex(inDir);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readers and users will never know the difference between loadIndex() and loadIndexDirectly(). Maybe move loadIndexDirectly() to MMappedQueryableSegmentizerFactory. Along with some other methods.

Copy link
Copy Markdown
Contributor Author

@kaijianding kaijianding Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but many related code should be moved to.

@JsonSubTypes(value = {
@JsonSubTypes.Type(name = "v9", value = IndexMergerV9.class)
})
@ImplementedBy(IndexMergerV9.class)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ImplementedBy is likely not needed since Json defaultImpl is added.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ImplementedBy is guice thing, I'm not sure it can work properly after removing it if it is not a json case

setupEncodedValueWriter();
}

protected List<Indexed<String>> toDimValuesList(List<IndexableAdapter> adapters) throws IOException
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can understand you changed this class in order to be able to extend it in your code. See comment to SmooshedFileMapper, don't do this. Copy class entirely into your codebase.

@@ -174,7 +174,7 @@ public static final AggregationTestHelper createSelectQueryAggregationTestHelper
new InjectableValues.Std().addValue(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting

new InjectableValues.Std()
  .addValue(SelectQueryConfig.class, new SelectQueryConfig(true))
  .addValue(IndexIO.class, TestHelper.getTestIndexIO())

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jun 30, 2017

Just a drive by comment on the public api stuff. I believe none of the segment-type extension points are public apis at this point (see http://druid.io/docs/latest/development/modules.html) and that's because we're not yet entirely sure what custom segment extensions should look like and what support they might need from core druid.

IMO, if @kaijianding is ok with this then it's fine for his custom extensions to use internal apis. It's part of the learning process of what should and shouldn't be public for these kinds of extensions. At some point in the future, we might have a better idea of what should be public, and then we can make those apis public.

@kaijianding
Copy link
Copy Markdown
Contributor Author

I don't build custom segment from the very beginning but do some wrap on top of the old codes, like MetaData, Smoosh, V9 indexMerger and V9 loader, the sink and onHeapIncreamentalIndex.

I modified the structure of the final smooshed file, but I didn't modify the v9 format itself. Then I need to access the metadata.drd and use v9 loader to load index from the smooshed file from position P1 to Pn. That's the reason I want MetaData and IndexLoader to be public, thus I can use it in my extension to load the custom segment from file. @gianm @leventov

@leventov
Copy link
Copy Markdown
Member

@kaijianding I'm ok with just making (or keeping) classes and constructors public, but I'm not OK with adding properties and fields which are unused in the core codebase and protected methods to classes that have no subclasses in the core codebase. It increases complexity of the core for nothing. And it's not a step towards modular design, the ultimate goal of this PR as well.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jun 30, 2017

Are there a lot of properties and fields in the patch that don't need to be protected or even exist at all? (Sorry, I haven't had a chance to read it yet). I'm hoping we can find a balance between keeping the core clean, and making the extension possible.

@kaijianding
Copy link
Copy Markdown
Contributor Author

Fine, will restore some changes and modify them when I provide an actually custom implement @leventov

@stale
Copy link
Copy Markdown

stale Bot commented Feb 28, 2019

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale Bot added the stale label Feb 28, 2019
@stale
Copy link
Copy Markdown

stale Bot commented Mar 7, 2019

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale Bot closed this Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants