You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 20, 2026. It is now read-only.
Is your feature request related to a problem? Please describe.
Partitioning a DS leads to more write-scaling however the write-scaling is bottlenecked by the amount of replication we do for a DS. With CNR, the best scaling factor we can get for a partitionable data-structure is min(cores per replica, partitions). Hence if we vary the amount of replication dynamically (#52) we also need to be able to adjust the partitioning.
To clarify the formula: 4 sockets, 48 cores per socket:
with replication factor of four, it doesn't make sense to have more than 48 partitions
with replication factor of 1 it makes sense to have 192 partitions
Describe the solution you'd like
Dynamic Partitions
Some thoughts on dynamically varying the partitions in CNR data-structures
Background: A usize is nice because
We have a neutral element 0 (identity)
We can merge multiple usize together with +
and it’s associative which makes it a monoid I guess
It’s also commutative not sure if this matters
We can split it with subtraction (or division) operations
Not sure what is better division allows us to go from 1 -> n partition
So let’s say we have Partitionable(usize) we can…
Make a new partition with Partitionable(0)
Merge two partitions (e.g., drop one) with Partitionable(a) + Partitionable(b) = Partitionable(a+b)
Split a partition into two e.g., by doing subtraction: Partition(max(x-part, 0)), Partition(part))
And we still have some meaningful things we can define on it:
Increment (on one partition)
Sum (as a scan over all partitions)
[TODO: Need to figure out exactly what properties we need from Monoid, Ring etc.]
Background: OS page-tables
For a page-table in nrkernel, we partitioned statically; every PML4Entry (first level of the radix tree) was one partition.
For dynamic partitions, maybe the data-structure for every partitions is a slice of [&PML4Entry(idx)] with index going from 0..512
The regular NR case is just [PML4Entry; 512]
The original CNR case is 512 partitions of [PMLEntry; 1]
The new case is something in between these two (including these two cases)
We have a neutral element: []
We can make a new Partition by making a slice with no entries [] (neutral element)
The sum of all the entries in all slices will always be <= 512, we could potentially make more by using lower levels, but there is always (even if it’s just theoretical) a limit
Not sure if no entries is better here as it’s technically the neutral element of the slice
We can add things together by merging the slices [a] + [b] = [a, b], [] + [a] = [a]
Unclear if we should require that addition of [a, b] + [c] only works if a < b < c and there is no “gap” (other PM4 entries between b and c and a and b)?
This would probably simplify the hash-function which now needs to change dynamically (more on that later) but I don’t think it’s strictly necessary?
We also have associativity (not sure yet if this is a strict requirement for us)
Slice addition isn’t commutative (not sure if it would be nice to have this property)
We can split things up by “sub-slicing” [a,b] -> [a], [b] or [a] -> [a], []
We can define some operations on this
Map (return an error if we would need to map somewhere that we don’t have the partition for it in the slice – hasher needs to ensure it routes to the correct partition more on that below)
Unmap (same as above)
etc.
Some thoughts on implementing this
We need to be able to route the operations to the right partitions. Right now we have a static hash(op, &mut Vec<logIds>) function that the user provides for a data-structure.
In order for dynamic partitions to work, the hash function needs to be dynamically generated which means the user should provide a higher order function, e.g., something like this:
E.g., generate_hashfn(description of all partitions) -> fn hash(op, &mut Vec<LogIds>) { … } (TBD: Not clear how "description of all partitions" looks like in code)
Adding a new partition:
We need to allocate a log for it
We need to register with the log on all replicas
We need to use the generate_hashfn to route operations to the new partition
If the partition is made from the neutral element (e.g., 0 in the usize case) and doesn’t not made from merging some other partitions, then it might be that we’re done here
Merging two partitions:
Let’s assume we only ever merge two partitions into one for simplicity
we need a quiescent state for (at least) one of the partition
Make sure no new ops are incoming (e.g., take combiner lock of both replicas?)
Then drain the log; apply all outstanding ops on all replicas
What about scans? They might complicate things: not sure if we can just “drain” a log by applying all outstanding operations (a scan depends on having elements in multiple logs and one root log) a scan where the “root” is in another log will depend on that entry still being there once it executes the scan but now we’re just about to throw this log away
Maybe we can keep the log around for a while until all scans have completed?
Not sure this is a good idea either; seems like a bad idea to have two logs for a single partition…
Now we call a merge function that takes the partition and merges it with the one that we haven’t drained
Now the log of the other partition can be dropped
Finally need to use generate-hashfn to route ops to new merged partition
We could use a scan operation to merge two partition (scan op is already getting a consistent view on n partitions)
One limit in the current code we don’t support scans with less than max logs in the instance; scans (always affects all logs)
This brings additional complexity due to numa affinity,
E.g., imagine having the page-table with 192 partitions (on the 4socket machine) and one replica
This has maximum write scalability
Eventually we want to have read scaling and add 4 replicas
Now when we create a copy need to be smart about cloning e.g., we track the numa affinity of every partition and sort it so each replica gets the partitions that are already numa local and copies the other partitions that aren’t
Policies for adding/removing replicas and partitions
We need an interface so the client using the library can implement the policy when to change partitions (and replicas fwiw)
Example policies:
Page-table
read-write ratio (PMUs give approx reads), mmap gives writes
Which cores the process runs on is also known
Let the application decide
FS
Look at read:write ratio
Here it makes less sense to have the application decide
Scheduler
Maybe change partitions based on HW ACPI core-hotplug events?
API pseudo-code: adding and merging partitions
TBD
Describe alternatives you've considered
Don't change partitions ever set to num of cores in system. What's the overhead? What are the downsides? Deadlocks(?), might not know amount of cores in advance (hotplugging).
Is your feature request related to a problem? Please describe.
Partitioning a DS leads to more write-scaling however the write-scaling is bottlenecked by the amount of replication we do for a DS. With CNR, the best scaling factor we can get for a partitionable data-structure is
min(cores per replica, partitions). Hence if we vary the amount of replication dynamically (#52) we also need to be able to adjust the partitioning.To clarify the formula: 4 sockets, 48 cores per socket:
Describe the solution you'd like
Dynamic Partitions
Some thoughts on dynamically varying the partitions in CNR data-structures
Background: A
usizeis nice becauseSo let’s say we have Partitionable(usize) we can…
And we still have some meaningful things we can define on it:
[TODO: Need to figure out exactly what properties we need from Monoid, Ring etc.]
Background: OS page-tables
For a page-table in nrkernel, we partitioned statically; every PML4Entry (first level of the radix tree) was one partition.
For dynamic partitions, maybe the data-structure for every partitions is a slice of [&PML4Entry(idx)] with index going from 0..512
We have a neutral element: []
We can add things together by merging the slices [a] + [b] = [a, b], [] + [a] = [a]
We can split things up by “sub-slicing” [a,b] -> [a], [b] or [a] -> [a], []
We can define some operations on this
Some thoughts on implementing this
We need to be able to route the operations to the right partitions. Right now we have a static
hash(op, &mut Vec<logIds>)function that the user provides for a data-structure.In order for dynamic partitions to work, the
hashfunction needs to be dynamically generated which means the user should provide a higher order function, e.g., something like this:E.g.,
generate_hashfn(description of all partitions) -> fn hash(op, &mut Vec<LogIds>) { … }(TBD: Not clear how "description of all partitions" looks like in code)Adding a new partition:
generate_hashfnto route operations to the new partitionMerging two partitions:
Splitting a partition into two
TBD
Interaction of replicas and partition
Policies for adding/removing replicas and partitions
We need an interface so the client using the library can implement the policy when to change partitions (and replicas fwiw)
Example policies:
API pseudo-code: adding and merging partitions
TBD
Describe alternatives you've considered