Skip to content

Refactor SegmentId to more clearly distinguish between table and non-table segments: table segments must have a non-empty dataSource, while non-table segments must have an empty dataSource.#17954

Closed
cecemei wants to merge 6 commits intoapache:masterfrom
cecemei:segment-id

Conversation

@cecemei
Copy link
Copy Markdown
Contributor

@cecemei cecemei commented Apr 29, 2025

This PR proposes to add a new field dataSourceType in SegmentId, which is an enum of TABLE, LOOKUP, INLINE, EXTERNAL, FRAME. Only TABLE segment can have non-empty dataSource field, others must have empty dataSource. This would be used in PolicyEnforcer to enforce Policy for table segment.

Besides that:

  • Non-table segments can use version to store some info. E.x. for lookup segment, the version is the lookup name.
  • Deprecated the usage of dummy factory method in SegmentId, and replace it with simple and simpleTable.
  • Only table segment works with the serde of toString and iterateAllPossibleParsings.
  • dataSourceType is ignored in toString.
  • When parsing a SegmentId string, the dataSource field can't be empty. This was not enforced before this PR. E.x. before this PR, _2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_ver can be parsed into a SegmentId with empty dataSource, after this PR, it cannot.

This PR is a follow-up to #17774 (review).

Key changed/added classes in this PR
  • SegmentId

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions Bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 29, 2025
@cecemei cecemei marked this pull request as ready for review April 29, 2025 17:30
@cecemei cecemei changed the title non-table segment should have empty dataSource Refactor SegmentId to more clearly distinguish between table and non-table segments: table segments must have a non-empty dataSource, while non-table segments must have an empty dataSource. Apr 29, 2025
@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Apr 30, 2025

I am a little skeptical of adding a new field to segment id, since it is used pretty much in the entirety of Druid.
Adding a new field which is going to be the same in all of the existing segments in the system feels like it is going to add a lot of unnecessary overhead in memory footprint.

Since the datasource field will always be empty for the non-table datasource types, how about we use some reserved datasource names for those cases instead?

The SegmentId class may still expose a method getDatasourceType() which will just return the appropriate value based on the value of the datasource. SegmentId can also have static create or of methods that create SegmentIds for specific datasource types. But we should avoid addition of a new field in the object.

cc: @gianm , @clintropolis

@cecemei
Copy link
Copy Markdown
Contributor Author

cecemei commented Apr 30, 2025

The main purpose is to be able to identify whether a segment is backed by a table or not, which seems like might be better if we just return null for getId(), non-table segment should not have a valid SegmentId. I opened a new PR to implement it: #17960. This PR is therefore deprecated.

I am a little skeptical of adding a new field to segment id, since it is used pretty much in the entirety of Druid. Adding a new field which is going to be the same in all of the existing segments in the system feels like it is going to add a lot of unnecessary overhead in memory footprint.

Since the datasource field will always be empty for the non-table datasource types, how about we use some reserved datasource names for those cases instead?

The SegmentId class may still expose a method getDatasourceType() which will just return the appropriate value based on the value of the datasource. SegmentId can also have static create or of methods that create SegmentIds for specific datasource types. But we should avoid addition of a new field in the object.

cc: @gianm , @clintropolis

@cecemei cecemei marked this pull request as draft April 30, 2025 22:55
@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented May 2, 2025

Closing this PR in favor of #17960 .
@cecemei , please feel free to reopen if needed.

@kfaraz kfaraz closed this May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Design Review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants