Improve simplebridge data plane performance#56
Improve simplebridge data plane performance#56sebymiano merged 2 commits intopolycube-network:masterfrom
Conversation
|
Thanks @sebymiano for this interesting PR. It is interesting to notice that in the previous version we were using the I still have a doubt about possible race conditions, in the previous approach there were not race conditions because the Besides that, I think a 64 bits counter is not needed, a 32 bits counter will last for ~100 years and whole value structure could fit on the same cache line. Another question is, should we move something of this logic to polycubed?, have a single thread that updates a map that can be shared by all cubes?, I don't want to end with a solution with N different threads only for updating a counter. |
| } __attribute__((packed)); | ||
|
|
||
| BPF_TABLE("hash", __be64, struct fwd_entry, fwdtable, 1024); | ||
| BPF_TABLE_SHARED("percpu_array", int, uint64_t, timestamp, 1); |
There was a problem hiding this comment.
Does this table need to be SHARED?
There was a problem hiding this comment.
Could it be just a standard array?
We are not updating from datapath, so there should not be advantages on using a percpu map.
There was a problem hiding this comment.
What about, if a control plane thread updates timestamp every second.
During the update lock will be per-cpu, not global.
But is also true that control plane thread updates all cpus values at the same time, by locking them all.
And it could be also matter of cache efficiency: if we have just an array instead of a percpu array.
I think it could be interesting to test if there is a performance difference with percpu/non-percpu versions.
There was a problem hiding this comment.
In the particular case of arrays there are not locks on the update.
AFAIK percpu maps are important when the datapath updates elements, for example a counter, each cpu has its own and can update it without using atomic or synchronized operations. When the map is only read by datapath there is not any difference.
mbertrone
left a comment
There was a problem hiding this comment.
Thanks Sebastiano for this PR!
It is interesting to notice the cost of hash table update, especially on multicore architecture, where most of the time is spent in a stale status waiting for the lock on hashtable bucket to be released.
I'll try to address some comment from Mauricio.
- eBFP has a 64bits architecture, so
u64updates are atomic. - Agree on having a
pcnfunction to get such timestamp.
| } __attribute__((packed)); | ||
|
|
||
| BPF_TABLE("hash", __be64, struct fwd_entry, fwdtable, 1024); | ||
| BPF_TABLE_SHARED("percpu_array", int, uint64_t, timestamp, 1); |
There was a problem hiding this comment.
What about, if a control plane thread updates timestamp every second.
During the update lock will be per-cpu, not global.
But is also true that control plane thread updates all cpus values at the same time, by locking them all.
And it could be also matter of cache efficiency: if we have just an array instead of a percpu array.
I think it could be interesting to test if there is a performance difference with percpu/non-percpu versions.
ee3be91 to
aa76132
Compare
71ddab3 to
d951865
Compare
This commit removes the bpf_ktime() helper from the Simplebridge used to remove old entries from the filtering database. This helper introduces a performance degradation that could be removed by using a custom thread that updates a percpu map with the timestamp at the same way it is done in the Iptables service. Signed-off-by: Sebastiano Miano <mianosebastiano@gmail.com>
This commit removes the bpf_map_update from the simplebridge standard path. In the learning phase, the filtering database is first checked with the src MAC address of the received packet; if the entry is not present the learning is performed, otherwise we just need to update the timestamp (or the src port). In the latter cases, the value of the entry is directly updated using the pointer returned from the lookup, instead of calling the update helper. In fact, after some tests, I discovered that this function introduces a huge performance degradation, while updating the value directly does not impact on the overall performance. With this commit, the multicore throughput (during forwarding) of the bridge increased from 3.5Mpps (with 64byte packets) to ~11.5Mpps. Signed-off-by: Sebastiano Miano <mianosebastiano@gmail.com>
d951865 to
0531903
Compare
@mbertrone @sebymiano Do we plan to hold this PR until the above are addressed? |
|
I think we are ready to merge it, @mauriciovasquezbernal what do you think? |
mauriciovasquezbernal
left a comment
There was a problem hiding this comment.
LGTM.
We can merge it now and open an issue for implementing the time_get_sec() support directly in polycube in the future.
This PR introduces a huge performance increase in the
Simplebridgeservice by removing some bottlenecks from the data path such as the call to thebpf_ktime_get_nsand thebpf_map_updateused to update the timestamp.In fact, after some tests, I discovered that this function introduces a huge performance degradation (more evident in the multi-core tests due the lock required by the update) while updating the value directly after the lookup does not impact on the overall performance.
With this commit, the multicore throughput (during forwarding) of the bridge increased from
3.5Mpps (with 64byte packets) to~11.5Mpps (with XDP and theredirect_mapchanges proposed in the PR #52.