Modify DataSegmentProvider to also return DataSegment#17021
Modify DataSegmentProvider to also return DataSegment#17021adarshsanjeev merged 11 commits intoapache:masterfrom
Conversation
| * Contains the {@link DataSegment} and {@link Segment}. The datasegment could be null if the segment is a dummy, such | ||
| * as those created by {@link org.apache.druid.msq.input.inline.InlineInputSliceReader}. | ||
| */ | ||
| public class SegmentWithMetadata implements Closeable |
There was a problem hiding this comment.
@findingrish Do you think there is an existing class which has these enteries ?
There was a problem hiding this comment.
There isn't a class which has these entries.
On the naming aspect, I find classes with similar naming pattern, for example, SegmentWithDescriptor encapsulates Segment and RichSegmentDescriptor.
Classes with similar naming pattern,
SegmentWithState
SegmentWithDescriptor
DataSegmentWithMetadata
DataSegmentWithLocation
DataSegmentsWithSchemas
A segment is represented using two classes, Segment which is the body and DataSegment which is the metadata, since this class combines the two, how about calling it CompleteSegment or FullSegment?
Also, I feel this POJO can be kept in the processing module for reuse.
There was a problem hiding this comment.
there is a lot of duplicate information between Segment and DataSegment, I think the only stuff DataSegment has that isn't available in Segment is the shardspec and compaction state, and also there is a Metadata class associated with Segment which makes this name a bit confusing.
Why not just extract what you need from DataSegment instead of including the whole thing? Granted, I don't have much context for this change, it just seems a strange combination, but is possible i'm missing something.
There was a problem hiding this comment.
Most usecases would require theDataSegment object itself to be passed, for example, returning this from a frameProcessor. We could cut down to the few objects which are not duplicated, but we would likely need to recreate the DataSegment again somewhere else if we do that.
There was a problem hiding this comment.
@clintropolis Does Segment provide LoadSpec, dimensions/metric names, binary version and size somewhere?
I guess we could convert the segment to a QueryableIndex to get the dimensions and metric names, is that what you had in mind?
There was a problem hiding this comment.
Segment doesn't have LoadSpec or size or binary version; LoadSpec from DataSegment is how the Segment is made in the first place more or less, which I think is why this composite type seems a bit strange. DataSegment is the stuff to tell something to load the actual segment, so it seems kind of funny to drag it into processing. Maybe I don't understand the use case behind these changes enough since DataSegment isn't really used at all today for processing beyond loading the data, and it isn't really obvious what the need for DataSegment or its stuff like LoadSpec and binary version and stuff is to process data once you have the Segment.
There was a problem hiding this comment.
Renamed the class
|
Overall the approach seems good to me. |
|
Added to Druid 31 as this is needed for a clean backport of #17152 |
Currently, TaskDataSegmentProvider fetches the DataSegment from the Coordinator while loading the segment, but just discards it later. This PR refactors this to also return the DataSegment so that it can be used by workers without a separate fetch.
Currently, TaskDataSegmentProvider fetches the DataSegment from the Coordinator while loading the segment, but just discards it later. This PR refactors this to also return the DataSegment so that it can be used by workers without a separate fetch. Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com>
Currently, TaskDataSegmentProvider fetches the DataSegment from the Coordinator while loading the segment, but just discards it later. This PR refactors this to also return the DataSegment so that it can be used by workers without a separate fetch.
This PR has: