KAFKA-14307; Controller time-based snapshots#12761
Conversation
c182211 to
74607ca
Compare
0c22f88 to
53c854f
Compare
…ntroller-time-snapshot
|
|
||
| void handleSnapshotFinished(Optional<Exception> exception) { | ||
| if (exception.isPresent()) { | ||
| log.error("Error while generating snapshot {}", generator.lastContainedLogOffset(), exception.get()); |
There was a problem hiding this comment.
If there is a snapshot failure, will we wait for the next interval before retrying? I wonder if we need a metric or something to get marked when this happens?
There was a problem hiding this comment.
If there is a snapshot failure, will we wait for the next interval before retrying?
Yeah. The snapshot counterscommittedBytesSinceLastSnapshotandoldestCommittedLogOnlyAppendTimestampare updated assuming that snapshot generation cannot fail.
We could add another reason ("snapshot failure") that triggers a snapshot immediately irrespective of the counters. I am concerned that this event/task my starve other controller events. Maybe it is better to just rely on timed snapshot and NoOpRecord to trigger another snapshot in the case of failures.
If there is a snapshot failure, will we wait for the next interval before retrying?
Let me check the KIPs but I don't think we define this metric. I can add it and update one of the KIPs if we agree to it here.
There was a problem hiding this comment.
How about this for the controller:
kafka.controller:type=KafkaController,name=MetadataSnapshotGenerationErrors
Incremented anytime the controller fails to generate a snapshot. Reset to zero anytime the controller restarts or a snapshot is successfully generated.
And this for the brokers:
kafka.server:type=broker-metadata-metrics,name=snapshot-generation-errors
Incremented anytime the broker fails to generate a snapshot. Reset to zero anytime the broker restarts or a snapshot is successfully generated.
@hachikuji do you mind if I implement this in a future PR? I created this issue: https://issues.apache.org/jira/browse/KAFKA-14403
There was a problem hiding this comment.
Doing it separately sounds fine.
There was a problem hiding this comment.
kafka.controller:type=KafkaController,name=MetadataSnapshotGenerationErrors
I am concerned that with a counter that resets at restart, we might have an ever so small likelihood of never incrementing the metric even with the log moving and no snapshots generated e.g. if we keep crashing while generating a snapshot. Ditto for the broker.
What do you think about having a kafka.controller:type=KafkaController,name=LastSnapshotTime metric.
This metric would just emit the timestamp (linux epoch) of when the last snapshot was successfully written.
There was a problem hiding this comment.
Okay. Let's move this discussion to the DISCUSSION thread in Kafka dev. I'll send a message next week.
| ); | ||
|
|
||
| if (offset >= raftClient.latestSnapshotId().map(OffsetAndEpoch::offset).orElse(0L)) { | ||
| oldestCommittedLogOnlyAppendTimestamp = Math.min( |
There was a problem hiding this comment.
Why "LogOnlyAppendTimestamp" instead of "LogAppendTimestamp"?
There was a problem hiding this comment.
It is the oldest timestamp in the log that is not included in a snapshot. I was thinking that oldestCommittedLogAppendTimestamp could mislead the reader if they don't read the description of that variable:
/**
* Timestamp for the oldest record that was committed but not included in a snapshot.
*/
private long oldestCommittedLogOnlyAppendTimestamp = Long.MAX_VALUE;There was a problem hiding this comment.
nit: Maybe overkill, but how about something even more verbose like oldestNonSnapshottedLogAppendTimestamp.
The current name still had me read the description.
There was a problem hiding this comment.
I was thinking of a similar name when I was implementing this feature but I felt it was too verbose. I don't have a strong opinion here so I can change it to whatever the reader prefers.
There was a problem hiding this comment.
I do think oldestNonSnapshottedLogAppendTimestamp is a little clearer. Perhaps the fact that it is an append timestamp is already clear from context and we could use oldestNonSnapshottedTimestamp? I also don't feel strongly though.
There was a problem hiding this comment.
I picked oldestNonSnapshottedTimestamp. Updated the PR.l
niket-goel
left a comment
There was a problem hiding this comment.
Thanks for the changes @jsancio ! Left a few minor comments.
|
|
||
| // Delete every in-memory snapshot up to the committed offset. They are not needed since this | ||
| // snapshot generation finished. | ||
| snapshotRegistry.deleteSnapshotsUpTo(lastCommittedOffset); |
There was a problem hiding this comment.
Question: Is this OK to do even if the snapshot never succeeded? I think the answer would be yes, but just confirming.
There was a problem hiding this comment.
Yes. The controller needs to keep in-memory snapshot older than the last committed offset because a snapshot may be iterating over those timelined values. Once we know that there are no pending snapshots (either because it succeeded or it failed) the controller can delete in-memory snapshots up to the committed offset.
| ); | ||
|
|
||
| if (offset >= raftClient.latestSnapshotId().map(OffsetAndEpoch::offset).orElse(0L)) { | ||
| oldestCommittedLogOnlyAppendTimestamp = Math.min( |
There was a problem hiding this comment.
nit: Maybe overkill, but how about something even more verbose like oldestNonSnapshottedLogAppendTimestamp.
The current name still had me read the description.
| committedBytesSinceLastSnapshot = 0; | ||
| if (!snapshotReasons.isEmpty()) { | ||
| if (!isActiveController()) { | ||
| // The active controller creates in-memory snapshot every time an uncommitted |
There was a problem hiding this comment.
Nit:
Would be a little easier to read if the comment was rephrased to put what is happening here:
// The standby controllers do not create in-memory snapshots every time a
// batch gets appended, so we create it now.
There was a problem hiding this comment.
I reorder the sentences and improved the wording. I kept both sentences because I think they are important for anyone interested in understanding when the controllers generates and deletes in-memory snapshots.
| val MetadataSnapshotMaxIntervalMsDoc = "This is the maximum number of milliseconds to wait to generate a snapshot " + | ||
| "if there are committed records in the log that are not included in the latest snapshot. A value of zero disables " + | ||
| s"time based snapshot generation. The default value is ${Defaults.MetadataSnapshotMaxIntervalMs}. To geneate " + | ||
| s"snapshots based on the number of metadata bytes, see the <code>$MetadataSnapshotMaxNewRecordBytesProp</code> " + |
There was a problem hiding this comment.
Instead of just referring to the other configuration, I was thinking we could mention that snapshots will be taken when either the interval is reached or the max bytes limit is reached.
There was a problem hiding this comment.
Done. Added a sentence to both descriptions explaining this.
| ); | ||
|
|
||
| if (offset >= raftClient.latestSnapshotId().map(OffsetAndEpoch::offset).orElse(0L)) { | ||
| oldestCommittedLogOnlyAppendTimestamp = Math.min( |
There was a problem hiding this comment.
I do think oldestNonSnapshottedLogAppendTimestamp is a little clearer. Perhaps the fact that it is an append timestamp is already clear from context and we could use oldestNonSnapshottedTimestamp? I also don't feel strongly though.
| cancelNextGenerateSnapshot(); | ||
| } | ||
| } else { | ||
| /* Skip snapshot generation if there is a snaphshot in progress. |
There was a problem hiding this comment.
nit: I think this would be a little clearer if the if check is inverted:
if (snapshotGeneratorManager.snapshotInProgress()) {
/* Skip snapshot generation if there is a snaphshot in progress.
...
} else {|
|
||
| // The snapshot counters for size-based and time-based snapshots could have changed to cause a new | ||
| // snapshot to get generated. | ||
| maybeGenerateSnapshot(); |
There was a problem hiding this comment.
In case there was a failure, does it make sense to back off before retrying?
There was a problem hiding this comment.
The quorum controller resets the size-based (committedBytesSinceLastSnapshot) and time-based (oldestNonSnapshottedTimestamp) variables when it starts a snapshot. If the snapshot fails these variables were still reset when snapshot generation started.
This acts as a throttle. Snapshot generation will be trigger as often as what is described in metadata.log.max.record.bytes.between.snapshots and metadata.log.max.snapshot.interval.ms even if the snapshot happens to fail.
Implement time based snapshot for the controller. The general strategy for this feature is that the controller will use the record-batch's append time to determine if a snapshot should be generated. If the oldest record that has been committed but is not included in the latest snapshot is older than `metadata.log.max.snapshot.interval.ms`, the controller will trigger a snapshot immediately. This is useful in case the controller was offline for more that `metadata.log.max.snapshot.interval.ms` milliseconds. If the oldest record that has been committed but is not included in the latest snapshot is NOT older than `metadata.log.max.snapshot.interval.ms`, the controller will schedule a `maybeGenerateSnapshot` deferred task. It is possible that when the controller wants to generate a new snapshot, either because of time or number of bytes, the controller is currently generating a snapshot. In this case the `SnapshotGeneratorManager` was changed so that it checks and potentially triggers another snapshot when the currently in-progress snapshot finishes. To better support this feature the following additional changes were made: 1. The configuration `metadata.log.max.snapshot.interval.ms` was added to `KafkaConfig` with a default value of one hour. 2. `RaftClient` was extended to return the latest snapshot id. This snapshot id is used to determine if a given record is included in a snapshot. 3. Improve the `SnapshotReason` type to support the inclusion of values in the message. Reviewers: Jason Gustafson <jason@confluent.io>, Niket Goel <niket-goel@users.noreply.github.com>
Implement time based snapshot for the controller. The general strategy for this feature is that the controller will use the record-batch's append time to determine if a snapshot should be generated. If the oldest record that has been committed but is not included in the latest snapshot is older than
metadata.log.max.snapshot.interval.ms, the controller will trigger a snapshot immediately. This is useful in case the controller was offline for more thatmetadata.log.max.snapshot.interval.msmilliseconds.If the oldest record that has been committed but is not included in the latest snapshot is NOT older than
metadata.log.max.snapshot.interval.ms, the controller will schedule amaybeGenerateSnapshotdeferred task.It is possible that when the controller wants to generate a new snapshot, either because of time or number of bytes, the controller is currently generating a snapshot. In this case the
SnapshotGeneratorManagerwas changed so that it checks and potentially triggers another snapshot when the currently in-progress snapshot finishes.To better support this feature the following additional changes were made:
metadata.log.max.snapshot.interval.mswas added toKafkaConfigwith a default value of one hour.RaftClientwas extended to return the latest snapshot id. This snapshot id is used to determine if a given record is included in a snapshot.SnapshotReasontype to support the inclusion of values in the message.Committer Checklist (excluded from commit message)