optimize unordered_map in profiler#166
Conversation
|
Looks good, thanks! |
pranavsharma
left a comment
There was a problem hiding this comment.
Why even bother calling the EndTime... method if the profiler is disabled? This way you can tackle the problem at the source.
|
It's checked inside EndTime. If it's disabled it returned directly. We can move the check outside EndTime but it looks less nicer. |
|
As it is, EndTime still constructs a throwaway string object (node_name + "_kernel_time") that is wasteful. We also hit this in a scenario in WinML where these calls were costing us. It would be nice to not have any heap allocations due to EndTime. |
|
@tracysh exactly. std::string actually has small length optimization so if the length is short likely no allocation happens, although std::string construction is still a waste. At least I didn't see allocations from std::string from my CPU profiles. std::unordered_map is an allocation beast. Actually I'm not even sure why EventRecord (EndTime calls it internally) accepts unordered_map, it likely can be std::initializer_list. |
|
I just stepped through my winml binary and I see these string calls hit the heap. I'm all for changing the map code too as it reduces the size of the executor function, but at some point, I also want to see these strings not constructed. |
|
@tracysh I'm actually hunting all the dynamic allocations. operator new takes ~9% in my profile and end-2-end latency is very bad. Here it's just one of many. There's more severe allocations in other place: make_unique, and TensorShape itself is a vector etc. I'm trying to see what I can do with these. Will loop you in. |
As you can see, it's too late to check inside EndTime. The API might require some rework for sure. I think for now we should avoid invoking it in the first place if the profiler is disabled. |
|
@pranavsharma Actually I thought about reworking API. There're two options: 1) Expose IsEnabled() and check if profiler is enabled before EndTime(). This is less nicer as you have to do this before every profiler API call. 2) Create a macro that wraps IsEnabled() an profiler API. I'm not a super fan of abusing macro. Which option you prefer, or you have better idea? |
The macro option is fine. You can do it in a separate PR since this one is already merged. |
fix uplevel by reverting QNN 2.36
Arguments of Profiler::EndTimeAndRecordEvent are constructed even profiler is disabled. The unordered_map construction is very expensive in terms of both CPU and latency (dynamic allocation then memory contention). It costs 6.1% CPU in test. After this optimization, end-2-end avg latency reduced from 686us to 608us (11.4%), 1min CPU samples reduced from 19K to 13K (31,6%). operator new inc% reduced from 8.9% to 4.6%. (Absolute values subject to variance test by test)
The test is performed under 1000QPS, 10 threads on production V15 machine.