Skip to content

Conversation

@tkonolige
Copy link
Contributor

This PR adds an optional dependency on PAPI (https://bitbucket.org/icl/papi/) in order to collect hardware performance counters on CPU and CUDA. These performance counters include data like total cycles, instructions executed, and cache misses. Users can control which performance counters are collected by setting the TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated list of metrics.

@leandron @areusch @junrushao1994 @icemelon9

Copy link
Contributor

@leandron leandron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good improvement so that we can get better data for performance analysis.

As this is a new dependency, I looked into their website, and it seems they license the project on BSD terms, https://bitbucket.org/icl/papi/src/master/LICENSE.txt.

Based on that, I think this PR should also add a copy of this license under licenses/ (https://github.com/apache/tvm/tree/main/licenses)? cc @tqchen @tmoreau89

set(ENV{PKG_CONFIG_PATH} "${USE_PAPI}:$ENV{PKG_CONFIG_PATH}")
pkg_check_modules(PAPI REQUIRED IMPORTED_TARGET papi>=6.0)
list(APPEND TVM_RUNTIME_LINKER_LIBS PkgConfig::PAPI)
list(APPEND RUNTIME_SRCS src/runtime/contrib/papi/papi.cc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know whether this works as expected for cross-compilation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this works with cross compiling. PAPI does support cross compilation though. I think you'd have to set PKG_CONFIG_PATH or USE_PAPI to point to the cross compiled library.

@leandron
Copy link
Contributor

leandron commented May 6, 2021

These performance counters include data like total cycles, instructions executed, and cache misses. Users can control which performance counters are collected by setting the TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated list of metrics.

Can you also document the use of $TVM_PAPI_${DEVICE}_METRICS somewhere? @hogepodge can probably advise here.

@tqchen
Copy link
Member

tqchen commented May 6, 2021

licenses will be needed if we bundle PAI's source code in our source distribution (in the case when it is included as a submodule). If it is a system level dep, we do not need to add licenses to the licenses folder

@hogepodge
Copy link
Contributor

@leandron @tkonolige I would suggest two pieces of documentation. The first would be a description of all of the variables available to configure Papi integration. The second would be a "how-to" guide that would walk the user through installing with Papi and running a basic instrumentation. The combination of these two should help users who are interested be successful with Papi.

Timer timer;
/*! Extra performance metrics */
std::unordered_map<std::string, ObjectRef> extra_metrics;
/*! User defined metric collectors */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason this isn't a map? could you explain in a comment? what's the second item in the pair?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never do a lookup by key; I just iterate over all the pairs. I could switch it to a map if you like.


return profiling::Report(rows, device_metrics);
// the last couple of call frames are the overall times
double overall_time = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest unit suffix: overall_time_us

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still suggest this

metrics["Argument Shapes"] = profiling::ShapeString(shapes);

prof_.StartCall(packed_index_map_[packed_index], dev, metrics);
prof_.operator*().StartCall(packed_index_map_[packed_index], dev, metrics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does -> not work?

Copy link
Contributor Author

@tkonolige tkonolige May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope :( -> is const only I think.

int event_set = it->second;
std::vector<long_long> values(papi_metrics[dev].size());
PAPI_CALL(PAPI_read(event_set, values.data()));
return ObjectRef(make_object<PAPIEventSetNode>(values, dev));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you give an example of nesting? I just see

StartCall()
// run kernel
StopCall()

my concern is that putting a bunch of malloc calls in between time points may inject a bunch of noise in the measurements

@tkonolige
Copy link
Contributor Author

@areusch and I chatted a little offline about how to handle mallocs in profiling code. Right now there are a couple of places that do allocation and they would be pretty difficult to remove. Also, as long as there are not any nested Start calls (besides the top level nesting), the overhead of malloc is counted overhead section of the full model execution and has no effect on the performance of each op invocation. We should move forward with this PR as it stands and I will think about ways of reducing the amount of allocation that we do.

@areusch areusch added the status: need update need update based on feedbacks label May 12, 2021
@tkonolige tkonolige force-pushed the profiler_papi branch 3 times, most recently from 8a6cba8 to df95b27 Compare June 10, 2021 21:16
@tkonolige
Copy link
Contributor Author

@leandron @tqchen @areusch Could you all re-review? I've pushed a change to the api to make it more closely match PassIntrument.

@tkonolige
Copy link
Contributor Author

@leandron @tqchen @areusch Can you review?

return ret.strip(",").split(",") if ret else []

def profile(self, **input_dict):
def profile(self, collectors=[], **input_dict): # pylint: disable=dangerous-default-value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have a # pylint: disable=dangerous-default-value, after setting the collectors=[].

Just wanted to double check whether this is intended, as collectors=[] behaves like a global, and there will be only one list that will be shared across all the calls to profile() in which collectors is not provided?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah don't use a list as a kwarg default. Instead, use None, and say collectors = collectors if collectors is not None else [] in function body

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is never modified, so it should be fine.

return ""

def profile(self, *args, func_name="main", **kwargs):
def profile(self, *args, func_name="main", collectors=[], **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting pylint didn't complain about this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix this one too?

primitive_names[packed_index] = it.first;
}
strm->Write(primitive_names);
// TODO(tkonolige): cannot serialize ObjectRefs with dmlc's serializer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not needed, maybe can be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed, saving and loading an executable will not preserve the extra op attributes.

tkonolige and others added 14 commits June 29, 2021 09:56
…and CUDA

This PR adds an optional dependency on PAPI
(https://bitbucket.org/icl/papi/) in order to collect hardware
performance counters on CPU and CUDA. These performance counters include
data like total cycles, instructions executed, and cache misses. Users
can control which performance counters are collected by setting the
TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated
list of metrics.
Co-authored-by: Leandro Nunes <leandro.nunes@arm.com>
return ret.strip(",").split(",") if ret else []

def profile(self, **input_dict):
def profile(self, collectors=[], **input_dict): # pylint: disable=dangerous-default-value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah don't use a list as a kwarg default. Instead, use None, and say collectors = collectors if collectors is not None else [] in function body

// Stop all global timers. We wait to synchronize until we are making the report.
for (auto p : global_timers_) {
p.second->Stop();
is_running_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any thoughts on how errors should get handled here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what errors would occur here?


return profiling::Report(rows, device_metrics);
// the last couple of call frames are the overall times
double overall_time = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still suggest this

@tkonolige
Copy link
Contributor Author

@areusch ping

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple more questions, i think this looks fine otherwise

return ""

def profile(self, *args, func_name="main", **kwargs):
def profile(self, *args, func_name="main", collectors=[], **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix this one too?

@tkonolige
Copy link
Contributor Author

@leandron Does this PR look good to you or would you like some more changes?

queues_.clear();
threads_.reset();

void Init() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait but this should be private, no?

@areusch
Copy link
Contributor

areusch commented Jul 7, 2021

@leandron @tqchen please take a look and explicitly approve

*
* PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
*/
struct PAPIMetricCollectorNode final : public MetricCollectorNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a implementation of an existing interface, we usually prefer hide as much as possible. As a result, let us to start as private header(move the def to cc file or the header file to src)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users need to construct this to pass it to the profiler, so it needs to be in a public header.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the profiler itself is a public interface, what we could do instead is to only expose a factory function that returns the instance, but not the actual data structure itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved all definitions except a factory function into the .cc file.

Timer DefaultTimer(Device dev);

namespace profiling {
/*! \brief Wrapper for `Device` because `Device` is not passable across the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Device cannot be put as part of object container. How about DeviceObject?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeviceObjectNode sounds a little weird. I think wrapper clearly explains what this is doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not strongly attached to the choice, given it is in the profiling namespace, i will let you decide then

std::unordered_map<std::string, ObjectRef> extra_metrics;
/*! User defined metric collectors. Each pair is the MetricCollector and its
* associated data (returned from MetricCollector.Start).
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR but just in general about hiding. In this case, we could use https://en.cppreference.com/w/cpp/language/pimpl to hide the CallFrame def, recommend to come with a followup refactor

@areusch areusch merged commit d67514b into apache:main Jul 13, 2021
ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021
…and CUDA (apache#7983)

* [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

This PR adds an optional dependency on PAPI
(https://bitbucket.org/icl/papi/) in order to collect hardware
performance counters on CPU and CUDA. These performance counters include
data like total cycles, instructions executed, and cache misses. Users
can control which performance counters are collected by setting the
TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated
list of metrics.

* Update CMakeLists.txt

Co-authored-by: Leandro Nunes <leandro.nunes@arm.com>

* move thread pool reset out of crt

* add docs

* comments

* formatting

* forgot one doc

* kDLGPU -> kDLCUDA

* Refactor API to more closely match pass instrument's.

* forgot files

* formatting

* more lint

* fix docs

* optional loading of papi metric collector in python

* more formatting

* fix check

* update docs and default value

* formatting

* addressing andrews comments

* fix docs

* address comments

* move shared initialization code into private function

* move most definitions from papi header to implementation file

Co-authored-by: Leandro Nunes <leandro.nunes@arm.com>
zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Mar 4, 2022
…and CUDA (apache#7983)

* [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

This PR adds an optional dependency on PAPI
(https://bitbucket.org/icl/papi/) in order to collect hardware
performance counters on CPU and CUDA. These performance counters include
data like total cycles, instructions executed, and cache misses. Users
can control which performance counters are collected by setting the
TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated
list of metrics.

* Update CMakeLists.txt

Co-authored-by: Leandro Nunes <leandro.nunes@arm.com>

* move thread pool reset out of crt

* add docs

* comments

* formatting

* forgot one doc

* kDLGPU -> kDLCUDA

* Refactor API to more closely match pass instrument's.

* forgot files

* formatting

* more lint

* fix docs

* optional loading of papi metric collector in python

* more formatting

* fix check

* update docs and default value

* formatting

* addressing andrews comments

* fix docs

* address comments

* move shared initialization code into private function

* move most definitions from papi header to implementation file

Co-authored-by: Leandro Nunes <leandro.nunes@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: need update need update based on feedbacks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants