-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA #7983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
leandron
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good improvement so that we can get better data for performance analysis.
As this is a new dependency, I looked into their website, and it seems they license the project on BSD terms, https://bitbucket.org/icl/papi/src/master/LICENSE.txt.
Based on that, I think this PR should also add a copy of this license under licenses/ (https://github.com/apache/tvm/tree/main/licenses)? cc @tqchen @tmoreau89
| set(ENV{PKG_CONFIG_PATH} "${USE_PAPI}:$ENV{PKG_CONFIG_PATH}") | ||
| pkg_check_modules(PAPI REQUIRED IMPORTED_TARGET papi>=6.0) | ||
| list(APPEND TVM_RUNTIME_LINKER_LIBS PkgConfig::PAPI) | ||
| list(APPEND RUNTIME_SRCS src/runtime/contrib/papi/papi.cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know whether this works as expected for cross-compilation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this works with cross compiling. PAPI does support cross compilation though. I think you'd have to set PKG_CONFIG_PATH or USE_PAPI to point to the cross compiled library.
Can you also document the use of |
|
licenses will be needed if we bundle PAI's source code in our source distribution (in the case when it is included as a submodule). If it is a system level dep, we do not need to add licenses to the licenses folder |
|
@leandron @tkonolige I would suggest two pieces of documentation. The first would be a description of all of the variables available to configure Papi integration. The second would be a "how-to" guide that would walk the user through installing with Papi and running a basic instrumentation. The combination of these two should help users who are interested be successful with Papi. |
include/tvm/runtime/profiling.h
Outdated
| Timer timer; | ||
| /*! Extra performance metrics */ | ||
| std::unordered_map<std::string, ObjectRef> extra_metrics; | ||
| /*! User defined metric collectors */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason this isn't a map? could you explain in a comment? what's the second item in the pair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never do a lookup by key; I just iterate over all the pairs. I could switch it to a map if you like.
src/runtime/profiling.cc
Outdated
|
|
||
| return profiling::Report(rows, device_metrics); | ||
| // the last couple of call frames are the overall times | ||
| double overall_time = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest unit suffix: overall_time_us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still suggest this
| metrics["Argument Shapes"] = profiling::ShapeString(shapes); | ||
|
|
||
| prof_.StartCall(packed_index_map_[packed_index], dev, metrics); | ||
| prof_.operator*().StartCall(packed_index_map_[packed_index], dev, metrics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does -> not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope :( -> is const only I think.
| int event_set = it->second; | ||
| std::vector<long_long> values(papi_metrics[dev].size()); | ||
| PAPI_CALL(PAPI_read(event_set, values.data())); | ||
| return ObjectRef(make_object<PAPIEventSetNode>(values, dev)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you give an example of nesting? I just see
StartCall()
// run kernel
StopCall()
my concern is that putting a bunch of malloc calls in between time points may inject a bunch of noise in the measurements
|
@areusch and I chatted a little offline about how to handle mallocs in profiling code. Right now there are a couple of places that do allocation and they would be pretty difficult to remove. Also, as long as there are not any nested |
8a6cba8 to
df95b27
Compare
| return ret.strip(",").split(",") if ret else [] | ||
|
|
||
| def profile(self, **input_dict): | ||
| def profile(self, collectors=[], **input_dict): # pylint: disable=dangerous-default-value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you have a # pylint: disable=dangerous-default-value, after setting the collectors=[].
Just wanted to double check whether this is intended, as collectors=[] behaves like a global, and there will be only one list that will be shared across all the calls to profile() in which collectors is not provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah don't use a list as a kwarg default. Instead, use None, and say collectors = collectors if collectors is not None else [] in function body
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list is never modified, so it should be fine.
python/tvm/runtime/profiler_vm.py
Outdated
| return "" | ||
|
|
||
| def profile(self, *args, func_name="main", **kwargs): | ||
| def profile(self, *args, func_name="main", collectors=[], **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting pylint didn't complain about this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fix this one too?
| primitive_names[packed_index] = it.first; | ||
| } | ||
| strm->Write(primitive_names); | ||
| // TODO(tkonolige): cannot serialize ObjectRefs with dmlc's serializer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is not needed, maybe can be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed, saving and loading an executable will not preserve the extra op attributes.
…and CUDA This PR adds an optional dependency on PAPI (https://bitbucket.org/icl/papi/) in order to collect hardware performance counters on CPU and CUDA. These performance counters include data like total cycles, instructions executed, and cache misses. Users can control which performance counters are collected by setting the TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated list of metrics.
Co-authored-by: Leandro Nunes <leandro.nunes@arm.com>
| return ret.strip(",").split(",") if ret else [] | ||
|
|
||
| def profile(self, **input_dict): | ||
| def profile(self, collectors=[], **input_dict): # pylint: disable=dangerous-default-value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah don't use a list as a kwarg default. Instead, use None, and say collectors = collectors if collectors is not None else [] in function body
| // Stop all global timers. We wait to synchronize until we are making the report. | ||
| for (auto p : global_timers_) { | ||
| p.second->Stop(); | ||
| is_running_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any thoughts on how errors should get handled here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what errors would occur here?
src/runtime/profiling.cc
Outdated
|
|
||
| return profiling::Report(rows, device_metrics); | ||
| // the last couple of call frames are the overall times | ||
| double overall_time = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still suggest this
|
@areusch ping |
areusch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple more questions, i think this looks fine otherwise
python/tvm/runtime/profiler_vm.py
Outdated
| return "" | ||
|
|
||
| def profile(self, *args, func_name="main", **kwargs): | ||
| def profile(self, *args, func_name="main", collectors=[], **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fix this one too?
|
@leandron Does this PR look good to you or would you like some more changes? |
src/runtime/thread_pool.cc
Outdated
| queues_.clear(); | ||
| threads_.reset(); | ||
|
|
||
| void Init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait but this should be private, no?
include/tvm/runtime/contrib/papi.h
Outdated
| * | ||
| * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/. | ||
| */ | ||
| struct PAPIMetricCollectorNode final : public MetricCollectorNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is a implementation of an existing interface, we usually prefer hide as much as possible. As a result, let us to start as private header(move the def to cc file or the header file to src)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users need to construct this to pass it to the profiler, so it needs to be in a public header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the profiler itself is a public interface, what we could do instead is to only expose a factory function that returns the instance, but not the actual data structure itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved all definitions except a factory function into the .cc file.
| Timer DefaultTimer(Device dev); | ||
|
|
||
| namespace profiling { | ||
| /*! \brief Wrapper for `Device` because `Device` is not passable across the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Device cannot be put as part of object container. How about DeviceObject?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeviceObjectNode sounds a little weird. I think wrapper clearly explains what this is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not strongly attached to the choice, given it is in the profiling namespace, i will let you decide then
| std::unordered_map<std::string, ObjectRef> extra_metrics; | ||
| /*! User defined metric collectors. Each pair is the MetricCollector and its | ||
| * associated data (returned from MetricCollector.Start). | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR but just in general about hiding. In this case, we could use https://en.cppreference.com/w/cpp/language/pimpl to hide the CallFrame def, recommend to come with a followup refactor
…and CUDA (apache#7983) * [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA This PR adds an optional dependency on PAPI (https://bitbucket.org/icl/papi/) in order to collect hardware performance counters on CPU and CUDA. These performance counters include data like total cycles, instructions executed, and cache misses. Users can control which performance counters are collected by setting the TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated list of metrics. * Update CMakeLists.txt Co-authored-by: Leandro Nunes <leandro.nunes@arm.com> * move thread pool reset out of crt * add docs * comments * formatting * forgot one doc * kDLGPU -> kDLCUDA * Refactor API to more closely match pass instrument's. * forgot files * formatting * more lint * fix docs * optional loading of papi metric collector in python * more formatting * fix check * update docs and default value * formatting * addressing andrews comments * fix docs * address comments * move shared initialization code into private function * move most definitions from papi header to implementation file Co-authored-by: Leandro Nunes <leandro.nunes@arm.com>
…and CUDA (apache#7983) * [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA This PR adds an optional dependency on PAPI (https://bitbucket.org/icl/papi/) in order to collect hardware performance counters on CPU and CUDA. These performance counters include data like total cycles, instructions executed, and cache misses. Users can control which performance counters are collected by setting the TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated list of metrics. * Update CMakeLists.txt Co-authored-by: Leandro Nunes <leandro.nunes@arm.com> * move thread pool reset out of crt * add docs * comments * formatting * forgot one doc * kDLGPU -> kDLCUDA * Refactor API to more closely match pass instrument's. * forgot files * formatting * more lint * fix docs * optional loading of papi metric collector in python * more formatting * fix check * update docs and default value * formatting * addressing andrews comments * fix docs * address comments * move shared initialization code into private function * move most definitions from papi header to implementation file Co-authored-by: Leandro Nunes <leandro.nunes@arm.com>
This PR adds an optional dependency on PAPI (https://bitbucket.org/icl/papi/) in order to collect hardware performance counters on CPU and CUDA. These performance counters include data like total cycles, instructions executed, and cache misses. Users can control which performance counters are collected by setting the TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated list of metrics.
@leandron @areusch @junrushao1994 @icemelon9