Skip to content

AppenderatorDriver metadata loses information about publishing segments before they are published #4743

@pjain1

Description

@pjain1

I see a problem with the current way of managing metadata about publishing segments in AppenderatorDriver. Lets say a task calls AppenderatorDriver to publish segments for a sequence, then the driver will remove the sequence information from activeSegments and publishPendingSegments map. Now, if the task is restarted at a point after in memory data is persisted and metadata is committed but before any mergeAndPush or publish happens, then on restart if the task again tells driver to publish the same sequence, there is no way for the driver to know what segments to publish.

If you look at the code for publish or publishAll method of AppenderatorDriver the first thing that is done is to remove the sequence information from activeSegments and publishPendingSegments map. After that push is called on Appenderator with wrapped committer which will contain driver metadata (with the sequence information removed). In the push implementation in memory data is persisted using persist method which also commits the metadata to disk. So, what I was saying is that if task is restarted at this point, the restored metadata might be incomplete as this sequence information will not be restored. Any further calls to publish this sequence wouldn't do anything.

One way to resolve this would be to maintain an additional in-memory structure that contains sequences which are being published. On call to publish, sequence information is removed from activeSegments map but not from publishPendingSegments and added to the new in-memory structure. On successful publish sequence information is removed from the publishPendingSegments and in-memory structure. Any call to publish should try to publish sequence if it is not in the in-memory structure (it may not be in the activeSegment map when task is restarted at the point of time above described).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions