Skip to content

realtime cleanup and renaming#16652

Closed
clintropolis wants to merge 5 commits intoapache:masterfrom
clintropolis:names-are-important
Closed

realtime cleanup and renaming#16652
clintropolis wants to merge 5 commits intoapache:masterfrom
clintropolis:names-are-important

Conversation

@clintropolis
Copy link
Copy Markdown
Member

changes:

  • FireHydrant is now PartialSegment. This name much more clearly describes what this class does, and with all the other fireman terminology removed it didn't even fit a theme anymore.
  • Sink is now AppendableSegment. This name also much more clearly describes what this class does, and is composed of PartialSegments per the previous FireHydrant rename.
  • Additionally, SinkQuerySegmentWalker -> AppendableSegmentQuerySegmentWalker, and SinkQueryRunner -> AppendableSegmentQueryRunner
  • Remove Firehose, IngestSegmentFirehose was only used by Hadoop indexing DruidRecordReader, moved to internal class of DruidRecordReader as SegmentReader
  • Remove FirehoseFactory and remaining implementations, after Remove index_realtime and index_realtime_appenderator tasks #16602 they were no longer used
  • Moved things from org.apache.druid.segment.realtime.sink and org.apache.druid.segment.realtime.firehose up to org.apache.druid.segment.realtime.

Discussion: should we move metric ingest/sink/count to ingest/appendableSegment/count or something similar? That is kind of disruptive since there is like a continuity of measurement issue which is kind of sad. We could also always double emit it, but also sad in its own way.

Release note

todo

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

changes:
* `FireHydrant` is now `PartialSegment`. This name much more clearly describes what this class does, and with all the other fireman terminology removed it didn't even fit a theme anymore.
* `Sink` is now `AppendableSegment`. This name also much more clearly describes what this class does, and is composed of `PartialSegments` per the previous `FireHydrant` rename.
* Additionally, `SinkQuerySegmentWalker` -> `AppendableSegmentQuerySegmentWalker`, and `SinkQueryRunner` -> `AppendableSegmentQueryRunner`
* Remove `Firehose`, `IngestSegmentFirehose` was only used by Hadoop indexing `DruidRecordReader`, moved to internal class of `DruidRecordReader` as `SegmentReader`
* Remove `FirehoseFactory` and remaining implementations, after apache#16602 they were no longer used
* Moved things from `org.apache.druid.segment.realtime.sink` and `org.apache.druid.segment.realtime.firehose` up to `org.apache.druid.segment.realtime`.
expectedException.expectMessage("segment.close() is called somewhere outside PartialSegment.swapSegment()");
partialSegment.getPartialSegment().close();
Assert.assertEquals(0, partialSegment.getPartialSegment().getNumReferences());
ReferenceCountingSegment segment = partialSegment.getIncrementedSegment();

Check notice

Code scanning / CodeQL

Unread local variable

Variable 'ReferenceCountingSegment segment' is never read.
final int hydrantNumber = Integer.parseInt(hydrantDir.getName());
List<PartialSegment> partialSegments = new ArrayList<>();
for (File partialSegmentDir : appendableSegmentFiles) {
final int partialSegmentNumber = Integer.parseInt(partialSegmentDir.getName());

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException

Potential uncaught 'java.lang.NumberFormatException'.
final int hydrantNumber = Integer.parseInt(hydrantDir.getName());
List<PartialSegment> partialSegments = new ArrayList<>();
for (File partialSegmentDir : appendableSegmentFiles) {
final int partialSegmentNumber = Integer.parseInt(partialSegmentDir.getName());

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException

Potential uncaught 'java.lang.NumberFormatException'.
final int hydrantNumber = Integer.parseInt(hydrantDir.getName());
List<PartialSegment> partialSegments = new ArrayList<>();
for (File partialSegmentDir : appendableSegmentFiles) {
final int partialSegmentNumber = Integer.parseInt(partialSegmentDir.getName());

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException

Potential uncaught 'java.lang.NumberFormatException'.
}

private int calculateSinkMemoryInUsed(Sink sink)
private int calculateAppendableSegmentMemoryInUsed(AppendableSegment appendableSegment)

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'appendableSegment' is never used.
@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Jun 25, 2024

Discussion: should we move metric ingest/sink/count to ingest/appendableSegment/count or something similar? That is kind of disruptive since there is like a continuity of measurement issue which is kind of sad. We could also always double emit it, but also sad in its own way.

I think it is okay to deprecate the older metric but keep emitting it along with the new one for a couple of releases.
We are already emitting a good number of metrics, one more is not going to hurt.
But at least this way we would be able to eventually get rid of the old one.

Also, @clintropolis , what do you think about calling it RealtimeSegment instead of AppendableSegment?
"Realtime segment" is a term that Druid community is already familiar with.
"Appendable segment" seems slightly ambiguous as it could be misinterpreted by some to mean a segment to which more data can be added even after it has been committed.

@clintropolis
Copy link
Copy Markdown
Member Author

Also, @clintropolis , what do you think about calling it RealtimeSegment instead of AppendableSegment?
"Realtime segment" is a term that Druid community is already familiar with.
"Appendable segment" seems slightly ambiguous as it could be misinterpreted by some to mean a segment to which more data can be added even after it has been committed.

My reasoning for AppendableSegment was because they are built with an Appenderator. Also "realtime" isn't really appropriate because they are also created during batch processing. I would be ok changing the metric name to something else, but I think the current name is more appropriate in code at least. I am open to other suggestions though, I considered 'mutable' segment but that is really only part of the AppendableSegment which is composed of one mutable PartialSegment and several other immutable intermediary persisted PartialSegment (using the new naming).

@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Jun 26, 2024

Also "realtime" isn't really appropriate because they are also created during batch processing.

Yeah, I wondered if this was the case, didn't see that BatchAppenderator itself is using sinks.
If I am not mistaken, a sink should always corresponding to a pending segment i.e. an entry in the pendingSegments metadata table. So maybe PendingSegment could be a better name?

@clintropolis
Copy link
Copy Markdown
Member Author

clintropolis commented Jun 26, 2024

If I am not mistaken, a sink should always corresponding to a pending segment i.e. an entry in the pendingSegments metadata table. So maybe PendingSegment could be a better name?

PendingSegment doesn't seem quite right to me because it seems like an odd name for the role that this thing plays in code. I think my problem here is the word 'pending' which seems like something you wouldn't want to be using in stuff like query processing because it sounds like something that isn't 'official' yet - which makes sense for how it is used for task locks, but less so for the internal code stuff imo.

What don't you like about AppendableSegment as an internal code construct name? Should we rename Appenderator too? I don't think it would be useful for the docs to talk about them as this and agree it might be confusing, but I don't think the docs widely called them Sink either since this is just internal naming stuff.

@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Jun 26, 2024

I think my problem here is the word 'pending' which seems like something you wouldn't want to be using in stuff like query processing because it sounds like something that isn't 'official' yet

Agreed. I have myself been on the fence about the 'pending' part too as it is certainly confusing. But then, I am not sure what a better alternative would have been either (maybe open, appending or appendable itself 😛 ).

In that regard, I do like AppendableSegment as it does clarify that this is a segment to which we can currently append more rows. I just wanted to be sure before we introduced a new(-ish) term in the code.

You seem to have given it enough thought and I do agree with the arguments against the other options, so let's go ahead with AppendableSegment!

I don't think we need to rename Appenderator as it works fine for its role.

@clintropolis
Copy link
Copy Markdown
Member Author

closing in favor of #16758

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants