Skip to content

Improve simplebridge data plane performance#56

Merged
sebymiano merged 2 commits intopolycube-network:masterfrom
sebymiano:pr/fix_simplebridge_performance
Feb 7, 2019
Merged

Improve simplebridge data plane performance#56
sebymiano merged 2 commits intopolycube-network:masterfrom
sebymiano:pr/fix_simplebridge_performance

Conversation

@sebymiano
Copy link
Collaborator

This PR introduces a huge performance increase in the Simplebridge service by removing some bottlenecks from the data path such as the call to the bpf_ktime_get_ns and the bpf_map_update used to update the timestamp.

In fact, after some tests, I discovered that this function introduces a huge performance degradation (more evident in the multi-core tests due the lock required by the update) while updating the value directly after the lookup does not impact on the overall performance.
With this commit, the multicore throughput (during forwarding) of the bridge increased from 3.5Mpps (with 64byte packets) to ~11.5Mpps (with XDP and the redirect_map changes proposed in the PR #52.

@mauriciovasquezbernal
Copy link
Contributor

mauriciovasquezbernal commented Feb 5, 2019

Thanks @sebymiano for this interesting PR.

It is interesting to notice that in the previous version we were using the bpf_map_update helper in all the cases, even when there was an entry present for that MAC address. hash maps are designed to be very fast on read, there are not locks in this case, but for write locks are needed, so this is one of the reasons why performance was not scaling on multicore systems.

I still have a doubt about possible race conditions, in the previous approach there were not race conditions because the bpf_map_update allocates a new node, copies the data to it and then using rcu makes it visible to the readers, in other words, concurrent readers will not see a partially updated version of the map entry.
In this version, the writer and the reader are operating on the same element (a pointer return by bpf_map_lookup), is this possible that a reader sees a partially updated version of the timestamp?, in other words, are 64bits stores atomic on ebpf?

Besides that, I think a 64 bits counter is not needed, a 32 bits counter will last for ~100 years and whole value structure could fit on the same cache line. Another question is, should we move something of this logic to polycubed?, have a single thread that updates a map that can be shared by all cubes?, I don't want to end with a solution with N different threads only for updating a counter.
Maybe we can provide a wrapper for this in polycube, a kind of pcn_get_timestamp().

} __attribute__((packed));

BPF_TABLE("hash", __be64, struct fwd_entry, fwdtable, 1024);
BPF_TABLE_SHARED("percpu_array", int, uint64_t, timestamp, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this table need to be SHARED?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be just a standard array?

We are not updating from datapath, so there should not be advantages on using a percpu map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about, if a control plane thread updates timestamp every second.
During the update lock will be per-cpu, not global.

But is also true that control plane thread updates all cpus values at the same time, by locking them all.
And it could be also matter of cache efficiency: if we have just an array instead of a percpu array.

I think it could be interesting to test if there is a performance difference with percpu/non-percpu versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the particular case of arrays there are not locks on the update.

AFAIK percpu maps are important when the datapath updates elements, for example a counter, each cpu has its own and can update it without using atomic or synchronized operations. When the map is only read by datapath there is not any difference.

Copy link
Contributor

@mbertrone mbertrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sebastiano for this PR!

It is interesting to notice the cost of hash table update, especially on multicore architecture, where most of the time is spent in a stale status waiting for the lock on hashtable bucket to be released.

I'll try to address some comment from Mauricio.

  • eBFP has a 64bits architecture, so u64updates are atomic.
  • Agree on having a pcn function to get such timestamp.

} __attribute__((packed));

BPF_TABLE("hash", __be64, struct fwd_entry, fwdtable, 1024);
BPF_TABLE_SHARED("percpu_array", int, uint64_t, timestamp, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about, if a control plane thread updates timestamp every second.
During the update lock will be per-cpu, not global.

But is also true that control plane thread updates all cpus values at the same time, by locking them all.
And it could be also matter of cache efficiency: if we have just an array instead of a percpu array.

I think it could be interesting to test if there is a performance difference with percpu/non-percpu versions.

@sebymiano sebymiano force-pushed the pr/fix_simplebridge_performance branch 2 times, most recently from ee3be91 to aa76132 Compare February 5, 2019 10:44
@mauriciovasquezbernal mauriciovasquezbernal force-pushed the pr/fix_simplebridge_performance branch 4 times, most recently from 71ddab3 to d951865 Compare February 6, 2019 17:09
This commit removes the bpf_ktime() helper from the Simplebridge
used to remove old entries from the filtering database.
This helper introduces a performance degradation that could be removed
by using a custom thread that updates a percpu map with the timestamp
at the same way it is done in the Iptables service.

Signed-off-by: Sebastiano Miano <mianosebastiano@gmail.com>
This commit removes the bpf_map_update from the simplebridge standard
path.
In the learning phase, the filtering database is first checked with the
src MAC address of the received packet; if the entry is not present the
learning is performed, otherwise we just need to update the timestamp (or the
src port). In the latter cases, the value of the entry is directly updated
using the pointer returned from the lookup, instead of calling the update helper.
In fact, after some tests, I discovered that this function introduces a huge
performance degradation, while updating the value directly does not impact on the
overall performance.
With this commit, the multicore throughput (during forwarding) of the bridge
increased from 3.5Mpps (with 64byte packets) to ~11.5Mpps.

Signed-off-by: Sebastiano Miano <mianosebastiano@gmail.com>
@mauriciovasquezbernal mauriciovasquezbernal force-pushed the pr/fix_simplebridge_performance branch from d951865 to 0531903 Compare February 6, 2019 18:14
@acloudiator
Copy link
Contributor

I'll try to address some comment from Mauricio.

  • eBFP has a 64bits architecture, so u64updates are atomic.
  • Agree on having a pcn function to get such timestamp.

@mbertrone @sebymiano Do we plan to hold this PR until the above are addressed?

@sebymiano
Copy link
Collaborator Author

I think we are ready to merge it, @mauriciovasquezbernal what do you think?

Copy link
Contributor

@mauriciovasquezbernal mauriciovasquezbernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

We can merge it now and open an issue for implementing the time_get_sec() support directly in polycube in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants