Implement IngesterAffinity broadcast#6152

Merged

nadav-govari merged 8 commits intonadav/feature-node-based-routingfrom

nadav/node-affinity-broadcast

Feb 17, 2026

Collaborator

nadav-govari commented Feb 11, 2026 •

edited

Loading

Background

Main idea: https://docs.google.com/document/d/1XUpBdMFnuX8d23erK-XwQkomRgbeRTJ0TJtve7RGW3k/edit?tab=t.0.

All work on this feature will be merged PR by PR into the base branch nadav/feature-node-based-routing, which will then eventually be merged into main once it's fully ready.

PR Description

Creates a new broadcast to prepare for node based routing. The idea is described more in depth in

The primary thinking here is:

Ingester affinity score for receiving new requests. This will be used in a weighted power of two choices comparison against other nodes. The node with the higher affinity score wins and receives the request for persistence.
The number of open shards for the individual index can act as a tiebreaker.
- This isn't perfect but we can iterate on it
  Ingesters will move away from keeping shard level data, and instead keep this node level data for routing requests. Routing tables will move to be node based and use the data from these broadcasts to update their routing tables.

nadav-govari added 3 commits

February 11, 2026 14:46


          WIP node affinity broadcast

cafe288


          Implement IngesterAffinity broadcast

b9ec6cf


          Make affinity score ints instead of floats

c87dbdd

nadav-govari changed the base branch from main to nadav/feature-node-based-routing

February 11, 2026 21:12

nadav-govari added 2 commits

February 11, 2026 16:34


          allow dead code for CI

1a80537


          make fmt and make fix

a0302a9

guilload reviewed

View reviewed changes

quickwit/quickwit-common/src/shared_consts.rs Outdated

               pub const INGESTER_PRIMARY_SHARDS_PREFIX: &str = "ingester.primary_shards:";
+              /// Key used in chitchat to broadcast the ingester affinity score and open shard counts.
+              pub const INGESTER_AFFINITY_PREFIX: &str = "ingester.affinity";

Member

guilload Feb 12, 2026

We already use the word affinity for searchers split affinity. I think we can find another ok name for this metric that we don't use already.

Collaborator Author

nadav-govari Feb 12, 2026

Yep, how's ingester capacity? As in, literally the capacity of the ingester to ingest new requests.

Collaborator Author

nadav-govari Feb 12, 2026

Renamed the task to BroadcastIngesterCapacity and all references from affinity to capacity.

quickwit/quickwit-ingest/src/ingest_v2/broadcast/ingester_affinity.rs Outdated


		pub type OpenShardCounts = Vec<(IndexUid, SourceId, usize)>;

		const WAL_CAPACITY_LOOKBACK_WINDOW_LEN: usize = 6;

Member

guilload Feb 12, 2026

This could use a comment. I assume you had a duration in mind for that window and then divided by BROADCAST_INTERVAL_PERIOD to get to 6. What's that window duration?

Collaborator Author

nadav-govari Feb 12, 2026

Adding. It was meant to be 30 seconds.

quickwit/quickwit-ingest/src/ingest_v2/broadcast/ingester_affinity.rs Outdated

+              struct WalCapacityTimeSeries {
+                  wal_capacity: ByteSize,
+                  readings: VecDeque<ByteSize>,

Member

guilload Feb 12, 2026

There's a better implementation of a timeserie based on a rotating time window in broadcast. This is a common pattern. So, move the og implementation in common. Abstractify enough so it can be used for both uses cases, import and use it here.

Collaborator Author

nadav-govari Feb 12, 2026

LocalShardUpdate and BroadcastIngesterCapacity now both use this new RingBuffer, which is in quickwit-common.

quickwit/quickwit-ingest/src/ingest_v2/broadcast/ingester_affinity.rs Outdated

+              }
+              impl WalCapacityTimeSeries {
+                  fn new(wal_capacity: ByteSize) -> Self {

Member

guilload Feb 12, 2026

mem or disk? the name should say it.

Collaborator Author

nadav-govari Feb 12, 2026

Disk, modified.

quickwit/quickwit-ingest/src/ingest_v2/broadcast/ingester_affinity.rs Outdated

+                          return None;
+                      }
+                      let oldest = if self.readings.len() > WAL_CAPACITY_LOOKBACK_WINDOW_LEN {
+                          self.readings.pop_back().unwrap()

Member

guilload Feb 12, 2026

Use expect and state the invariant/conditation that allow you to call expect safely:
.expect("window should not be empty")
.expect("window should have more than 1 measurements")

Collaborator Author

nadav-govari Feb 12, 2026

Noted, though this isn't relevant any longer with the RingBuffer changes.

quickwit/quickwit-ingest/src/ingest_v2/broadcast/ingester_affinity.rs Outdated

+                          .weak_state
+                          .upgrade()
+                          .context("ingester state has been dropped")?;

Member

guilload Feb 12, 2026

Just lock the whole thing fully and make the code more readable.

Collaborator Author

nadav-govari Feb 12, 2026

Done.

quickwit/quickwit-ingest/src/ingest_v2/broadcast/ingester_affinity.rs Outdated

+                              Ok(snapshot) => snapshot,
+                              Err(error) => {
+                                  error!("failed to snapshot ingester state: {error}");
+                                  return;

Member

guilload Feb 12, 2026

The WAL can take multiple BROADCAST_INTERVAL_PERIOD intervals to load. The task should not not stop when we're loading the WAL, only if the state is dropped.

Collaborator Author

nadav-govari Feb 12, 2026

Updated to the following cases:

State dropped: error, stop task
Ingester not initialized: no-op
Ingester ready: happy path

quickwit/quickwit-ingest/src/ingest_v2/broadcast/ingester_affinity.rs Outdated

+                      let value = serde_json::to_string(affinity)
+                          .expect("`IngesterAffinity` should be JSON serializable");
+                      self.cluster
+                          .set_self_key_value(INGESTER_AFFINITY_PREFIX, value)

Member

guilload Feb 12, 2026

You can't broadcast that over a single key because the open shard counts can be very long.
-> one key per index/source

Member

guilload Feb 12, 2026

(The value length is an issue because chitchat uses UDP and every update must fit in a single datagram (MTU))

Collaborator Author

nadav-govari Feb 12, 2026

Made it similar to LocalShardsUpdate, one key per index/source.

quickwit/quickwit-ingest/src/ingest_v2/state.rs Outdated

+                  pub fn get_open_shard_counts(&self) -> Vec<(IndexUid, SourceId, usize)> {
+                      self.shards
+                          .values()
+                          .filter(|shard| shard.is_open())

Member

guilload Feb 12, 2026

Suggested change

      
                        .filter(|shard| shard.is_open())
          
                        .filter(|shard| shard.is_advertisable && !shard.is_replica() && shard.is_open())

Collaborator Author

nadav-govari Feb 12, 2026

Took it.

quickwit/quickwit-ingest/src/ingest_v2/state.rs Show resolved Hide resolved

nadav-govari added 2 commits

February 12, 2026 16:04


          Address PR comments

94856a9


          fix test

8c63458

guilload approved these changes

View reviewed changes

quickwit/quickwit-common/src/ring_buffer.rs Outdated

+                      self.buffer.last().copied()
+                  }
+                  pub fn oldest(&self) -> Option<T> {

Member

guilload Feb 13, 2026

Suggested change

      
                pub fn oldest(&self) -> Option<T> {
          
                pub fn front(&self) -> Option<T> {

Collaborator Author

nadav-govari Feb 17, 2026

Done

quickwit/quickwit-common/src/ring_buffer.rs Outdated

+              }
+              impl<T: Copy + Default, const N: usize> RingBuffer<T, N> {
+                  pub fn push(&mut self, value: T) {

Member

guilload Feb 13, 2026

Suggested change

      
                pub fn push(&mut self, value: T) {
          
                pub fn push_back(&mut self, value: T) {

Let's just copy (half of) the VecDeque API.

Collaborator Author

nadav-govari Feb 17, 2026

Done

quickwit/quickwit-common/src/ring_buffer.rs

+              /// Elements are stored in a flat array of size `N` and rotated on each push.
+              /// The newest element is always at position `N - 1` (the last slot), and the
+              /// oldest is at position `N - len`.
+              pub struct RingBuffer<T: Copy + Default, const N: usize> {

Member

guilload Feb 13, 2026

Noice

quickwit/quickwit-ingest/src/ingest_v2/broadcast/ingester_capacity.rs Outdated

+                  readings: RingBuffer<ByteSize, WAL_CAPACITY_READINGS_LEN>,
+              }
+              impl WalDiskCapacityTimeSeries {

Member

guilload Feb 13, 2026

I thought we discussed using memory?

Collaborator Author

nadav-govari Feb 17, 2026

Yeah, now that I realize they're capped in the chart, I think they're functionally the same, but memory feels like a cleaner number to read. So I switched it to memory.

quickwit/quickwit-common/src/ring_buffer.rs

+              /// Elements are stored in a flat array of size `N` and rotated on each push.
+              /// The newest element is always at position `N - 1` (the last slot), and the
+              /// oldest is at position `N - len`.
+              pub struct RingBuffer<T: Copy + Default, const N: usize> {

Member

guilload Feb 17, 2026

Claude can easily make push O(1), right?

Collaborator Author

nadav-govari Feb 17, 2026

Yes it can :)

quickwit/quickwit-common/src/shared_consts.rs Outdated

               pub const INGESTER_PRIMARY_SHARDS_PREFIX: &str = "ingester.primary_shards:";
+              /// Prefix used in chitchat to broadcast per-source ingester capacity scores and open shard counts.
+              pub const INGESTER_CAPACITY_PREFIX: &str = "ingester.capacity:";

Member

guilload Feb 17, 2026

Suggested change

      
            pub const INGESTER_CAPACITY_PREFIX: &str = "ingester.capacity:";
          
            pub const INGESTER_CAPACITY_SCORE_PREFIX: &str = "ingester.capacity_score:";

Collaborator Author

nadav-govari Feb 17, 2026

Done

quickwit/quickwit-ingest/src/ingest_v2/broadcast/ingester_capacity_score.rs

    
            @@ -0,0 +1,457 @@
          
              // Copyright 2021-Present Datadog, Inc.

Member

guilload Feb 17, 2026

Let's use capacity_score everywhere.

Collaborator Author

nadav-govari Feb 17, 2026

Done

quickwit/quickwit-ingest/src/ingest_v2/broadcast/local_shards.rs

               /// Takes a snapshot of the primary shards hosted by the ingester at regular intervals and
               /// broadcasts it to other nodes via Chitchat.
-              pub(super) struct BroadcastLocalShardsTask {
+              pub struct BroadcastLocalShardsTask {

Member

guilload Feb 17, 2026

Suggested change

      
            pub struct BroadcastLocalShardsTask {
          
            pub(crate) struct BroadcastLocalShardsTask {


          Address PR comments

f61c387

nadav-govari merged commit 76cfc84 into nadav/feature-node-based-routing

10 of 14 checks passed

nadav-govari deleted the nadav/node-affinity-broadcast branch

February 17, 2026 16:45

nadav-govari added a commit that referenced this pull request


          Merge feature node based routing (#6203)

2bcade6

* Implement IngesterCapacityScore broadcast (#6152)

* Implement node based routing table (#6159)

* Use new node based routing table for routing decisions (#6163)

* Piggyback routing update on persist response (#6173)

* Remove unused shard_ids in persist protos (#6169)

* Add availability zone awareness to node based routing (#6189)

* Remove old routing table; Take both disk and memory WAL readings (#6193)

* Add az-aware ingest attempts metric (#6194)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet