Skip to content

Modify DataSegmentProvider to also return DataSegment#17021

Merged
adarshsanjeev merged 11 commits intoapache:masterfrom
adarshsanjeev:msq-datasegment-change
Sep 18, 2024
Merged

Modify DataSegmentProvider to also return DataSegment#17021
adarshsanjeev merged 11 commits intoapache:masterfrom
adarshsanjeev:msq-datasegment-change

Conversation

@adarshsanjeev
Copy link
Copy Markdown
Contributor

@adarshsanjeev adarshsanjeev commented Sep 9, 2024

Currently, TaskDataSegmentProvider fetches the DataSegment from the Coordinator while loading the segment, but just discards it later. This PR refactors this to also return the DataSegment so that it can be used by workers without a separate fetch.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions Bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Sep 9, 2024
* Contains the {@link DataSegment} and {@link Segment}. The datasegment could be null if the segment is a dummy, such
* as those created by {@link org.apache.druid.msq.input.inline.InlineInputSliceReader}.
*/
public class SegmentWithMetadata implements Closeable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findingrish Do you think there is an existing class which has these enteries ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a class which has these entries.

On the naming aspect, I find classes with similar naming pattern, for example, SegmentWithDescriptor encapsulates Segment and RichSegmentDescriptor.

Classes with similar naming pattern,
SegmentWithState
SegmentWithDescriptor
DataSegmentWithMetadata
DataSegmentWithLocation
DataSegmentsWithSchemas

A segment is represented using two classes, Segment which is the body and DataSegment which is the metadata, since this class combines the two, how about calling it CompleteSegment or FullSegment?

Also, I feel this POJO can be kept in the processing module for reuse.

Copy link
Copy Markdown
Member

@clintropolis clintropolis Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a lot of duplicate information between Segment and DataSegment, I think the only stuff DataSegment has that isn't available in Segment is the shardspec and compaction state, and also there is a Metadata class associated with Segment which makes this name a bit confusing.

Why not just extract what you need from DataSegment instead of including the whole thing? Granted, I don't have much context for this change, it just seems a strange combination, but is possible i'm missing something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most usecases would require theDataSegment object itself to be passed, for example, returning this from a frameProcessor. We could cut down to the few objects which are not duplicated, but we would likely need to recreate the DataSegment again somewhere else if we do that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis Does Segment provide LoadSpec, dimensions/metric names, binary version and size somewhere?

I guess we could convert the segment to a QueryableIndex to get the dimensions and metric names, is that what you had in mind?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Segment doesn't have LoadSpec or size or binary version; LoadSpec from DataSegment is how the Segment is made in the first place more or less, which I think is why this composite type seems a bit strange. DataSegment is the stuff to tell something to load the actual segment, so it seems kind of funny to drag it into processing. Maybe I don't understand the use case behind these changes enough since DataSegment isn't really used at all today for processing beyond loading the data, and it isn't really obvious what the need for DataSegment or its stuff like LoadSpec and binary version and stuff is to process data once you have the Segment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the class

@cryptoe
Copy link
Copy Markdown
Contributor

cryptoe commented Sep 10, 2024

Overall the approach seems good to me.

@adarshsanjeev adarshsanjeev merged commit 2f50138 into apache:master Sep 18, 2024
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@kfaraz kfaraz added this to the 31.0.0 milestone Oct 2, 2024
@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Oct 2, 2024

Added to Druid 31 as this is needed for a clean backport of #17152

kfaraz pushed a commit to kfaraz/druid that referenced this pull request Oct 2, 2024
Currently, TaskDataSegmentProvider fetches the DataSegment from the Coordinator while loading the segment, but just discards it later. This PR refactors this to also return the DataSegment so that it can be used by workers without a separate fetch.
kfaraz added a commit that referenced this pull request Oct 2, 2024
Currently, TaskDataSegmentProvider fetches the DataSegment from the Coordinator while loading the segment, but just discards it later. This PR refactors this to also return the DataSegment so that it can be used by workers without a separate fetch.

Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Segment Format and Ser/De

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants