Implement IngesterAffinity broadcast#6152
Implement IngesterAffinity broadcast#6152nadav-govari merged 8 commits intonadav/feature-node-based-routingfrom
Conversation
| pub const INGESTER_PRIMARY_SHARDS_PREFIX: &str = "ingester.primary_shards:"; | ||
|
|
||
| /// Key used in chitchat to broadcast the ingester affinity score and open shard counts. | ||
| pub const INGESTER_AFFINITY_PREFIX: &str = "ingester.affinity"; |
There was a problem hiding this comment.
We already use the word affinity for searchers split affinity. I think we can find another ok name for this metric that we don't use already.
There was a problem hiding this comment.
Yep, how's ingester capacity? As in, literally the capacity of the ingester to ingest new requests.
There was a problem hiding this comment.
Renamed the task to BroadcastIngesterCapacity and all references from affinity to capacity.
|
|
||
| pub type OpenShardCounts = Vec<(IndexUid, SourceId, usize)>; | ||
|
|
||
| const WAL_CAPACITY_LOOKBACK_WINDOW_LEN: usize = 6; |
There was a problem hiding this comment.
This could use a comment. I assume you had a duration in mind for that window and then divided by BROADCAST_INTERVAL_PERIOD to get to 6. What's that window duration?
There was a problem hiding this comment.
Adding. It was meant to be 30 seconds.
|
|
||
| struct WalCapacityTimeSeries { | ||
| wal_capacity: ByteSize, | ||
| readings: VecDeque<ByteSize>, |
There was a problem hiding this comment.
There's a better implementation of a timeserie based on a rotating time window in broadcast. This is a common pattern. So, move the og implementation in common. Abstractify enough so it can be used for both uses cases, import and use it here.
There was a problem hiding this comment.
LocalShardUpdate and BroadcastIngesterCapacity now both use this new RingBuffer, which is in quickwit-common.
| } | ||
|
|
||
| impl WalCapacityTimeSeries { | ||
| fn new(wal_capacity: ByteSize) -> Self { |
There was a problem hiding this comment.
mem or disk? the name should say it.
There was a problem hiding this comment.
Disk, modified.
| return None; | ||
| } | ||
| let oldest = if self.readings.len() > WAL_CAPACITY_LOOKBACK_WINDOW_LEN { | ||
| self.readings.pop_back().unwrap() |
There was a problem hiding this comment.
Use expect and state the invariant/conditation that allow you to call expect safely:
.expect("window should not be empty")
.expect("window should have more than 1 measurements")
There was a problem hiding this comment.
Noted, though this isn't relevant any longer with the RingBuffer changes.
| .weak_state | ||
| .upgrade() | ||
| .context("ingester state has been dropped")?; | ||
|
|
There was a problem hiding this comment.
Just lock the whole thing fully and make the code more readable.
| Ok(snapshot) => snapshot, | ||
| Err(error) => { | ||
| error!("failed to snapshot ingester state: {error}"); | ||
| return; |
There was a problem hiding this comment.
The WAL can take multiple BROADCAST_INTERVAL_PERIOD intervals to load. The task should not not stop when we're loading the WAL, only if the state is dropped.
There was a problem hiding this comment.
Updated to the following cases:
- State dropped: error, stop task
- Ingester not initialized: no-op
- Ingester ready: happy path
| let value = serde_json::to_string(affinity) | ||
| .expect("`IngesterAffinity` should be JSON serializable"); | ||
| self.cluster | ||
| .set_self_key_value(INGESTER_AFFINITY_PREFIX, value) |
There was a problem hiding this comment.
You can't broadcast that over a single key because the open shard counts can be very long.
-> one key per index/source
There was a problem hiding this comment.
(The value length is an issue because chitchat uses UDP and every update must fit in a single datagram (MTU))
There was a problem hiding this comment.
Made it similar to LocalShardsUpdate, one key per index/source.
| pub fn get_open_shard_counts(&self) -> Vec<(IndexUid, SourceId, usize)> { | ||
| self.shards | ||
| .values() | ||
| .filter(|shard| shard.is_open()) |
There was a problem hiding this comment.
| .filter(|shard| shard.is_open()) | |
| .filter(|shard| shard.is_advertisable && !shard.is_replica() && shard.is_open()) |
| self.buffer.last().copied() | ||
| } | ||
|
|
||
| pub fn oldest(&self) -> Option<T> { |
There was a problem hiding this comment.
| pub fn oldest(&self) -> Option<T> { | |
| pub fn front(&self) -> Option<T> { |
| } | ||
|
|
||
| impl<T: Copy + Default, const N: usize> RingBuffer<T, N> { | ||
| pub fn push(&mut self, value: T) { |
There was a problem hiding this comment.
| pub fn push(&mut self, value: T) { | |
| pub fn push_back(&mut self, value: T) { |
Let's just copy (half of) the VecDeque API.
| /// Elements are stored in a flat array of size `N` and rotated on each push. | ||
| /// The newest element is always at position `N - 1` (the last slot), and the | ||
| /// oldest is at position `N - len`. | ||
| pub struct RingBuffer<T: Copy + Default, const N: usize> { |
| readings: RingBuffer<ByteSize, WAL_CAPACITY_READINGS_LEN>, | ||
| } | ||
|
|
||
| impl WalDiskCapacityTimeSeries { |
There was a problem hiding this comment.
I thought we discussed using memory?
There was a problem hiding this comment.
Yeah, now that I realize they're capped in the chart, I think they're functionally the same, but memory feels like a cleaner number to read. So I switched it to memory.
| /// Elements are stored in a flat array of size `N` and rotated on each push. | ||
| /// The newest element is always at position `N - 1` (the last slot), and the | ||
| /// oldest is at position `N - len`. | ||
| pub struct RingBuffer<T: Copy + Default, const N: usize> { |
There was a problem hiding this comment.
Claude can easily make push O(1), right?
There was a problem hiding this comment.
Yes it can :)
| pub const INGESTER_PRIMARY_SHARDS_PREFIX: &str = "ingester.primary_shards:"; | ||
|
|
||
| /// Prefix used in chitchat to broadcast per-source ingester capacity scores and open shard counts. | ||
| pub const INGESTER_CAPACITY_PREFIX: &str = "ingester.capacity:"; |
There was a problem hiding this comment.
| pub const INGESTER_CAPACITY_PREFIX: &str = "ingester.capacity:"; | |
| pub const INGESTER_CAPACITY_SCORE_PREFIX: &str = "ingester.capacity_score:"; |
| @@ -0,0 +1,457 @@ | |||
| // Copyright 2021-Present Datadog, Inc. | |||
There was a problem hiding this comment.
Let's use capacity_score everywhere.
| /// Takes a snapshot of the primary shards hosted by the ingester at regular intervals and | ||
| /// broadcasts it to other nodes via Chitchat. | ||
| pub(super) struct BroadcastLocalShardsTask { | ||
| pub struct BroadcastLocalShardsTask { |
There was a problem hiding this comment.
| pub struct BroadcastLocalShardsTask { | |
| pub(crate) struct BroadcastLocalShardsTask { |
76cfc84
into
nadav/feature-node-based-routing
* Implement IngesterCapacityScore broadcast (#6152) * Implement node based routing table (#6159) * Use new node based routing table for routing decisions (#6163) * Piggyback routing update on persist response (#6173) * Remove unused shard_ids in persist protos (#6169) * Add availability zone awareness to node based routing (#6189) * Remove old routing table; Take both disk and memory WAL readings (#6193) * Add az-aware ingest attempts metric (#6194)
Background
Main idea: https://docs.google.com/document/d/1XUpBdMFnuX8d23erK-XwQkomRgbeRTJ0TJtve7RGW3k/edit?tab=t.0.
All work on this feature will be merged PR by PR into the base branch nadav/feature-node-based-routing, which will then eventually be merged into main once it's fully ready.
PR Description
Creates a new broadcast to prepare for node based routing. The idea is described more in depth in
The primary thinking here is:
Ingesters will move away from keeping shard level data, and instead keep this node level data for routing requests. Routing tables will move to be node based and use the data from these broadcasts to update their routing tables.