Skip to content

insert-segment tool#1861

Merged
fjy merged 1 commit intoapache:masterfrom
guobingkun:insert_segment_tool
Dec 29, 2015
Merged

insert-segment tool#1861
fjy merged 1 commit intoapache:masterfrom
guobingkun:insert_segment_tool

Conversation

@guobingkun
Copy link
Copy Markdown
Contributor

This tool can be used to load segments into Druid by inserting the segment's payload into metadata storage. People can use it to migrate segments to different deep storages or even recover segments as long as they have segments stored in the deep storage.

Usage example:
java -Ddruid.extensions.loadList=[\"mysql-metadata-storage\",\"druid-hdfs-storage\"] -cp $CLASSPATH io.druid.cli.Main tools insert-segment --workingDir hdfs://tmp/druid/localStorage/wikipedia/

Suppose under wikipedia, it looks like this,

├── 2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z
│   └── 2015-10-21T22:07:57.074Z
│       └── 0
│           ├── descriptor.json
│           └── index.zip
├── 2013-09-01T00:00:00.000Z_2013-09-02T00:00:00.000Z
│   └── 2015-10-21T22:07:57.074Z
│       └── 0
│           ├── descriptor.json
│           └── index.zip
├── 2013-09-02T00:00:00.000Z_2013-09-03T00:00:00.000Z
│   └── 2015-10-21T22:07:57.074Z
│       └── 0
│           ├── descriptor.json
│           └── index.zip
└── 2013-09-03T00:00:00.000Z_2013-09-04T00:00:00.000Z
    └── 2015-10-21T22:07:57.074Z
        └── 0
            ├── descriptor.json
            └── index.zip

Then these 4 segments will be found and inserted into metadata storage, each segment's loadSpec will be updated with the location from where it was found.

This PR depends on druid-io/druid-api#62
DataSegmentFinder is used so that this tool can work on different deep storages.
An HDFS version is implemented in this PR.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Oct 26, 2015

Can we add some documentation in the Druid docs about using this tool?

@guobingkun
Copy link
Copy Markdown
Contributor Author

@fjy yeah, I am working on it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this if updateDescriptor is set to false?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvrmnd, it seems that is needed for db update later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexZip.toString() would bring hdfs://host:port as well? I would just put the absolute path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some change so that it only puts relative path(without storage scheme prefix) in the loadSpec.

@drcrallen
Copy link
Copy Markdown
Contributor

This tool does not interact with the interval locking right? That means you may end up with nasty race conditions if used on an active cluster.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why comparing Strings and not DataSegment object itself which would be more reliable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was comparing DataSegment, but then found the implementation of equals() in DataSegment only compared identifier, so it will return true as long as two segments have the same identifier.

@himanshug
Copy link
Copy Markdown
Contributor

@drcrallen I think this tool is intended to be used in a situation where someone manually wants to migrate data from one place to another or rebuild metadata store segments table. In this usecase, we would expect user to have druid cluster in a safe mode (where there are no active tasks to interfere or just bringing down the cluster).

Adding the locking to this code will increase complexity and will also require for overlord to exist for this to work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description should say something about possible updation of descriptor.json on the fs too ?
can we make the name insert-segments-to-db ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done changing the name to insert-segments-to-db

@drcrallen
Copy link
Copy Markdown
Contributor

@himanshug Ok, just need to make sure to clarify that in the docs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename mysql to metadata storage credentials ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done renaming.

@guobingkun
Copy link
Copy Markdown
Contributor Author

@fjy @drcrallen @nishantmonu51 Added doc and emphasized the correct use case in the doc.
I also made some changes so that it only puts relative HDFS path in loadSpec, in this way there is no need to update database if segments are migrated from one HDFS to another (assuming the relative path doesn't change).

Tested this tool with 5436 segments, completed in 5 minutes.

Comment thread pom.xml Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update this version once the new druid-api is released.

@drcrallen drcrallen added this to the 0.9.0 milestone Dec 1, 2015
@guobingkun guobingkun closed this Dec 14, 2015
@guobingkun guobingkun reopened this Dec 14, 2015
@guobingkun guobingkun force-pushed the insert_segment_tool branch 2 times, most recently from 1421fc3 to 1621866 Compare December 15, 2015 21:50
@guobingkun guobingkun closed this Dec 16, 2015
@guobingkun guobingkun reopened this Dec 16, 2015
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doc is not linked anywhere, how are users expected to find it? can you add it to "operations" section in toc file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added into "operations" section.

@himanshug
Copy link
Copy Markdown
Contributor

👍 after #1861 (comment) is resolved.

Comment thread docs/content/toc.textile Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't like this TOC heading and this getting its own section

I prefer we rework the libraries section for Druid and for that page to link to the doc here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, can we add a link to this tool in the libraries section?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done adding a link to this tool in the libraries section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants