You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor SegmentId to more clearly distinguish between table and non-table segments: table segments must have a non-empty dataSource, while non-table segments must have an empty dataSource.#17954
This PR proposes to add a new field dataSourceType in SegmentId, which is an enum of TABLE, LOOKUP, INLINE, EXTERNAL, FRAME. Only TABLE segment can have non-empty dataSource field, others must have empty dataSource. This would be used in PolicyEnforcer to enforce Policy for table segment.
Besides that:
Non-table segments can use version to store some info. E.x. for lookup segment, the version is the lookup name.
Deprecated the usage of dummy factory method in SegmentId, and replace it with simple and simpleTable.
Only table segment works with the serde of toString and iterateAllPossibleParsings.
dataSourceType is ignored in toString.
When parsing a SegmentId string, the dataSource field can't be empty. This was not enforced before this PR. E.x. before this PR, _2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_ver can be parsed into a SegmentId with empty dataSource, after this PR, it cannot.
cecemei
changed the title
non-table segment should have empty dataSource
Refactor SegmentId to more clearly distinguish between table and non-table segments: table segments must have a non-empty dataSource, while non-table segments must have an empty dataSource.
Apr 29, 2025
I am a little skeptical of adding a new field to segment id, since it is used pretty much in the entirety of Druid.
Adding a new field which is going to be the same in all of the existing segments in the system feels like it is going to add a lot of unnecessary overhead in memory footprint.
Since the datasource field will always be empty for the non-table datasource types, how about we use some reserved datasource names for those cases instead?
The SegmentId class may still expose a method getDatasourceType() which will just return the appropriate value based on the value of the datasource. SegmentId can also have static create or of methods that create SegmentIds for specific datasource types. But we should avoid addition of a new field in the object.
The main purpose is to be able to identify whether a segment is backed by a table or not, which seems like might be better if we just return null for getId(), non-table segment should not have a valid SegmentId. I opened a new PR to implement it: #17960. This PR is therefore deprecated.
I am a little skeptical of adding a new field to segment id, since it is used pretty much in the entirety of Druid. Adding a new field which is going to be the same in all of the existing segments in the system feels like it is going to add a lot of unnecessary overhead in memory footprint.
Since the datasource field will always be empty for the non-table datasource types, how about we use some reserved datasource names for those cases instead?
The SegmentId class may still expose a method getDatasourceType() which will just return the appropriate value based on the value of the datasource. SegmentId can also have static create or of methods that create SegmentIds for specific datasource types. But we should avoid addition of a new field in the object.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR proposes to add a new field
dataSourceTypeinSegmentId, which is an enum ofTABLE, LOOKUP, INLINE, EXTERNAL, FRAME. OnlyTABLEsegment can have non-emptydataSourcefield, others must have emptydataSource. This would be used inPolicyEnforcerto enforcePolicyfor table segment.Besides that:
versionto store some info. E.x. for lookup segment, the version is the lookup name.dummyfactory method inSegmentId, and replace it withsimpleandsimpleTable.toStringanditerateAllPossibleParsings.dataSourceTypeis ignored intoString.SegmentIdstring, thedataSourcefield can't be empty. This was not enforced before this PR. E.x. before this PR,_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_vercan be parsed into aSegmentIdwith emptydataSource, after this PR, it cannot.This PR is a follow-up to #17774 (review).
Key changed/added classes in this PR
SegmentIdThis PR has: