Use unique segment paths for Kafka indexing by dclim · Pull Request #5692 · apache/druid

dclim · 2018-04-24T16:46:16Z

#5187 was intended to fix an issue with Kafka indexing service where a task would push segments to deep storage and then fail, and then the subsequent retry task would attempt to push its segment (which contains a different set of offsets from the first task) and publish the metadata, but the push would not happen because the segment data already existed in deep storage. In this scenario, the metadata (most importantly the offset cursors) would not match what's actually in the segment and exactly-once semantics would be violated.

This was fixed by supporting file overwriting in deep storage, but it turns out that there are other cases where overwriting the existing segment is undesired behavior. As an example (which may or may not be happening in practice), there could be a task which pushes its segments and publishes metadata but then fails before it can notify the supervisor that it is done. If the supervisor then creates a retry task to attempt again, the task will eventually fail when the transactional metadata commit fails, but by this point it will have already pushed its version of the segments into deep storage, overwriting the good set of segments there. This will lead to a similar situation as the previous case, where the same segment ID could have different data on historicals and in deep storage, and where Kafka messages can be duplicated or missed.

The cleanest way to handle all these cases is to have each Kafka task write its segments to a separate location in deep storage so that the metadata published definitely matches the segment set and there's no interaction between tasks. If a task pushes and then the metadata publish fails, it will attempt to clean up the orphaned segments, but it is expected that in some unhandled exception cases, the segments may remain in deep storage and may need to be cleaned up manually.

jihoonson · 2018-04-24T22:51:08Z

What do you think about receiving a suffix as a parameter and making the caller to be responsible for passing a unique suffix (like number of task attempts)? It would be safer than a short random UUID.

Hm, I can go either way on that. I actually think it would be less safe to have the suffix passed as a parameter, since a caller may not in general understand all the possible failure modes and may provide a non-random suffix that doesn't actually help the issue, and also may not understand what are valid filesystem characters for the suffix (but we can validate for this). The advantage would be that all segments generated by a particular task could be tagged with the same prefix for tracking, if that's something interesting to us. A UUID chopped to 5 characters still has over 1 million possibilities so I'm not too worried about collisions.

Ok. It sounds good.

jihoonson · 2018-04-24T22:55:05Z

uploadBlob() still has the replaceExisting parameter and this is always true. Is this intentional?

It was intentional yes, but now I'm thinking it might be better to just remove it. Honestly, I don't even know for each deep storage implementation how effective 'don't overwrite' semantics that depend on 'if object exists' are, since they are generally eventually consistent (which is why I retained overwrite behavior for most of them). I'll remove.

jihoonson · 2018-04-24T23:07:54Z

Would you add a comment about useUniquePath is used in push() instead of here?

Also, please add a check that useUniquePath is always false.

jihoonson · 2018-04-24T23:10:25Z

Please add a check that useUniquePath is always false.

I opted to remove useUniquePath from here and left a comment that this is only used by Hadoop indexing and never uses unique paths.

jihoonson

LGTM!

gianm

In addition to the individual line comments, DataSegmentFinders should be updated too. Clearly, we cannot know which segment to insert if there are multiple options. I think in that case we should log a warning and take the newest one.

The "insert-segment-to-db.md" doc needs to make it very clear that it is not going to necessarily be a perfect import, and is a tool only to be used in a last resort.

gianm · 2018-04-26T18:52:15Z

Could you add some javadocs for kill and killQuietly? It's semi-obvious what the difference is, but IMO not obvious enough to avoid docs.

gianm · 2018-04-26T18:53:14Z

But do this for 100,000 segments and all of a sudden the odds are high that there will be a collision for some segment. The birthday problem strikes again!

Embrace the full UUID.

Well, for there to be a collision, it has to happen as part of one of the failure scenarios, meaning that old shards from jobs run a long time ago aren't going to be involved in any collisions, even if there are millions of them. The scenarios would be things like task replicas both pushing to S3 at the same time, or a Kafka task failing after pushing or publishing and the replacement task pushing the same segment IDs, and in those cases, you'd have to have configured something really wrong to be pushing 100k segments from an indexing task.

But if you think it's still safer, I'm okay with using the full UUID; just trying to do my part to keep people from hitting their PATH_MAX or whatever on their file system.

I'm thinking of the situation where you have two replicas and they will always push to S3 around the same time (which seems likely). Even in a non-failure scenario there will still be two pushes going on at once. In that case it's likely that some pair of replicas will collide with itself.

I'm thinking of the situation where you have two replicas and they will always push to S3 around the same time (which seems likely).

Ah, I missed this case. If so, what do you think about receiving a suffix as a parameter and making the caller to be responsible for passing a unique suffix as commented above?

I'd generally prefer random instead of something like an attempt id. It's because with attempt ids, you have to worry about not assigning two tasks the same attempt id by accident, and we don't have a concept like this right now. With big random numbers you know for sure that you are ok, without needing to think about it too hard.

Okay sounds good to me

gianm · 2018-04-26T19:54:22Z

It looks like this is OK because this method will only be called by Hadoop M/R indexing, which won't set useUniquePath. It would be good to note that in a comment, so people don't get confused looking at this.

I don't thinkHdfsDataSegmentPusher.getStorageDir() is just called for Hadoop M/R, but it's only called from HdfsDataSegmentPusher.push() (which can be used by anything pushing to HDFS as a deep storage). But HdfsDataSegmentPusher.push() will always set this to false since any 'uniqueness' will be applied not to the directory but to the filename along with the shard (IIRC this was because of excessive directories in HDFS causing performance issues)

Ah, I meant it's only called directly by Hadoop M/R. Anyway this is great analysis - could you please include it as a comment?

gianm · 2018-04-26T20:25:50Z

What happens if a deep storage pusher impl doesn't respect useUniquePath? This will likely happen for extensions whose authors don't notice the signature change here. It's probably acceptable if what happens is you just get the old behavior.

Well, this is a little janky, but note that the signature didn't really change, but the meaning of that last boolean parameter did. At first I wanted to force a signature change so that implementors would have to acknowledge the change, but as discussed, decided to make it 'backward-compatible' for the point release. So implementors who don't notice the signature change will get 'replaceExisting=true' behavior for Kafka indexing tasks and 'replaceExisting=false' for all other task types, which seemed reasonable to me (since replaceExisting=true was added primarily for the Kafka indexing task type as well)

That sounds fair.

b-slim · 2018-04-26T20:26:22Z

@dclim not sure how someone would read this

the segments may remain in deep storage and may need to be cleaned up manually.

Can you please provide some guidance on how to find out those leaked segments and delete it manually?

dclim · 2018-04-26T22:03:29Z

@b-slim right now, it would involve some scripting, something like: list out the segment directory and find any partition directories that have more than one child directory, and then compare these directory names to the druid_segments table and delete any directory that isn't referenced in the loadSpec of an entry in that table. You'd probably also want to check timestamps and only process dangling segments from sometime in the past so you don't inadvertently wipe segments that are pushed but not published.

b-slim · 2018-04-26T23:55:10Z

@dclim this seems pretty complicated to me, so imagine user whom just adopted Druid.
IMO we need to fix this by doing this work you have described as a cron task running at supervisor to cleanup stuff after aborted transaction. I don't have to much knowledge about the design/fix here but i think this should be something like task runner that get a transaction ID that we can use to track aborted transactions and clean the leaked segments. I don't think putting the burden on the user is fair.

gianm · 2018-04-27T00:58:59Z

@b-slim @dclim how about making one change in this patch: kill segments in the "Our segments really do exist, awaiting handoff" path. Then, tasks will clean up after themselves, and the only way 'loose' segments would be lying around is if tasks die before they can clean up. And this can already lead to loose segments even today (if a segment is pushed but not published, it stays in deep storage forever). So it'd be just as good as what we have today.

b-slim · 2018-04-27T01:22:19Z

@dclim and @gianm am not asking this as part of this patch, but is it possible to have a persisted pointer (maybe as part of segments allocations table) this pointer can be path to a directory based on transaction id that can be used to track the committed/aborted handoffs and clean any aborted segments afterward even if the task dies before cleaning after it self?

dclim · 2018-04-27T06:10:40Z

@gianm changes made, thanks for the review guys. Good catch on killing segments at "Our segments really do exist, awaiting handoff", I meant to do that but missed it and only caught the exception case.

@b-slim that seems reasonable to me; I think it makes sense though to wait and see if anyone complains about this; I think the cases that it will leave garbage are actually quite small, and not much higher than the current implementation where tasks overwrite each other's segments.

gianm

@dclim The code looks good to me, but please update the insert-segment-to-db.md doc. It needs to make it clear that it has a risk of breaking the exactly-once guarantees, and the preferred method of migrating data is to migrate the metadata store dump as well.

dclim · 2018-04-28T00:18:48Z

@gianm added docs; also refactored common Finder code into a BaseDataSegmentFinder

b-slim · 2018-04-28T00:27:37Z

@dclim can you please add a short section about when the leak can occur and how to find leaked segments? Thanks

gianm

LGTM 👍

gianm · 2018-04-28T00:53:03Z

IMO, the implementations would be less convoluted if the helpers in this file were static utility methods rather than inherited superclass methods. I'm in the school of thought where inheritance is best used when nothing else really makes sense (neither composition nor utility methods).

In particular, I feel that the pattern in play here -- an abstract class implements a method from an interface and then creates a new method for its subclasses to implement -- is usually more difficult to understand than utility methods that can be called directly by the potential subclasses. It breaks the connection between the subclasses and the interface they implement, and requires the reader to understand the link established in the abstract class too.

I won't hold up the patch over this though.

gianm · 2018-04-28T01:00:13Z

@b-slim,

@dclim can you please add a short section about when the leak can occur and how to find leaked segments? Thanks

Where were you thinking about adding that? In the kafka indexing service docs?

b-slim · 2018-04-28T01:10:19Z

@gianm since it is a Kafka indexing issue i think we can added to the new Kafka ingestion page maybe as Misc section or Operation section for me the goal is more to document this i guess even a github issue can be enough if you don't like the idea of section within documentation.

gianm · 2018-04-28T01:22:31Z

I'm ok with a section in the docs, just wondering what you were thinking. If we add one I don't think it needs to be too alarmist, because with the cleanup-after-nonpublishing feature in this patch, there shouldn't be too many unused segment files lying around. Probably not more than already existed with Kafka indexing.

b-slim · 2018-04-28T02:58:45Z

@gianm yeah, as i said it is to help ppls operating the cluster/help me to know where to dig without reading the code base and maybe good to have it as issue that someone can tack if they want to.
Thanks

dclim · 2018-04-28T05:20:25Z

@gianm I agree with the base class making things more convoluted and refactored it; please check the last commit again when you get a chance

dclim · 2018-04-28T05:43:46Z

@gianm @b-slim I took another look through the code, and with the change Gian suggested to remove segments if we discover someone already published it before us, I can't find any real scenario in which garbage would be left behind, other than if there are bugs in Druid's code, such as these sanity checks tripping in AppenderatorImpl.mergeAndPush():

    // Sanity checks
    for (FireHydrant hydrant : sink) {
      if (sink.isWritable()) {
        throw new ISE("WTF?! Expected sink to be no longer writable before mergeAndPush. Segment[%s].", identifier);
      }

      synchronized (hydrant) {
        if (!hydrant.hasSwapped()) {
          throw new ISE("WTF?! Expected sink to be fully persisted before mergeAndPush. Segment[%s].", identifier);
        }
      }
    }

In these cases, if an exception was thrown after some segments had already been pushed, those segments would be orphaned.

But - my point here is that this is not something related at all to the useUniquePath functionality, nor is it something related to Kafka indexing exclusively. This is something that was always in Druid indexing and happens with every type of ingestion, where a sanity check failure due to broken Druid logic that causes the task to end prematurely does not clean up already pushed segments.

After looking at it more closely, I'm leaning towards the opinion that it would actually cause more harm than good to explicitly call it out in documentation since it would scare people unnecessarily. Folks who view this as a major issue should already have been running periodic consistency checks comparing deep storage to the metadata storage.

I think it might be interesting to write a tool, sorta like a 'druid-fsck', that can validate the consistency of the deep storage and clean up orphans, but that would be for another time.

gianm · 2018-04-30T02:57:25Z

@dclim that makes sense to me, so I'll merge this.

@b-slim I raised #5716 about a potential loose segment cleanup tool.

gianm · 2018-04-30T02:58:16Z

Wait, I won't merge this, since there's a checkstyle error showing up in travis. @dclim could you please take a look?

[ERROR] /home/travis/build/druid-io/druid/extensions-core/hdfs-storage/src/main/java/io/druid/storage/hdfs/HdfsDataSegmentPusher.java:54: 'HdfsDataSegmentPusher' has incorrect indentation level 0, expected level should be 4. [Indentation]

gianm · 2018-04-30T04:59:44Z

Travis is looking good; merging.

dclim · 2018-04-30T05:01:48Z

thanks @gianm

gianm · 2018-04-30T05:05:26Z

@dclim I tagged this 0.12.1 since I brought up on the dev list the prospect of including this there, and nobody argued. Does that sound reasonable to you & are you able to do a backport?

dclim · 2018-04-30T05:15:07Z

@gianm yes that sounds good, will backport

* support unique segment file paths * forbiddenapis * code review changes * code review changes * code review changes * checkstyle fix

dclim added the Bug label Apr 24, 2018

dclim assigned gianm Apr 24, 2018

jihoonson reviewed Apr 24, 2018

View reviewed changes

jihoonson approved these changes Apr 25, 2018

View reviewed changes

gianm reviewed Apr 26, 2018

View reviewed changes

gianm reviewed Apr 27, 2018

View reviewed changes

gianm approved these changes Apr 28, 2018

View reviewed changes

dclim added 5 commits April 28, 2018 23:04

support unique segment file paths

9463a25

forbiddenapis

e963828

code review changes

7f7d1fd

code review changes

a0bd84a

code review changes

60a0683

gianm mentioned this pull request Apr 30, 2018

Deep storage "fsck"-like tool #5716

Closed

checkstyle fix

4550bf7

gianm merged commit 8ec2d2f into apache:master Apr 30, 2018

dclim deleted the unique-segments branch April 30, 2018 05:01

gianm added the Area - Streaming Ingestion label Apr 30, 2018

gianm added this to the 0.12.1 milestone Apr 30, 2018

gianm pushed a commit that referenced this pull request Apr 30, 2018

Use unique segment paths for Kafka indexing (#5692) (#5718)

a23cd5c

* support unique segment file paths * forbiddenapis * code review changes * code review changes * code review changes * checkstyle fix

jihoonson mentioned this pull request May 4, 2018

Druid 0.12.1 release notes #5743

Closed

sathishsri88 pushed a commit to sathishs/druid that referenced this pull request May 8, 2018

Use unique segment paths for Kafka indexing (apache#5692)

d38f456

* support unique segment file paths * forbiddenapis * code review changes * code review changes * code review changes * checkstyle fix

clambertus unassigned gianm Jul 6, 2018

Conversation

dclim commented Apr 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

b-slim commented Apr 26, 2018

Uh oh!

dclim commented Apr 26, 2018

Uh oh!

b-slim commented Apr 26, 2018

Uh oh!

gianm commented Apr 27, 2018

Uh oh!

b-slim commented Apr 27, 2018

Uh oh!

dclim commented Apr 27, 2018

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

dclim commented Apr 28, 2018

Uh oh!

b-slim commented Apr 28, 2018

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

jihoonson Apr 24, 2018 •

edited

Loading

jihoonson Apr 24, 2018 •

edited

Loading