Add performance reporting feature. #75
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a detailed performance reporting feature to cuDecomp, enabled with a new environment variable
CUDECOMP_ENABLE_PERFORMANCE_REPORT. The purpose of this feature is to enable users to extract fine-grained performance information from transpose and halo operations invoked in their programs.When this feature is enabled, cuDecomp will capture timing information inside all calls to
cudecompTranspose*andcudecompUpdateHalo*. The timing information captured is the time spent in communication (alltoall, sendrecv) and time spent in local operations (including overheads). The number of samples stored is configurable viaCUDECOMP_PERFORMANCE_REPORT_NSAMPLESand defaults to 20 samples per transpose/halo configuration. A performance report is printed to the terminal upon grid descriptor destruction. See the updated documentation for more configuration variables.By default, only an aggregated performance report is printed. Here is an example of what that report looks like:
This timings and bandwidths reported in this table are averaged across samples and all ranks.
To retrieve more verbose per-sample output, the environment variable
CUDECOMP_PERFORMANCE_REPORT_DETAILcan be used. Setting this variable to1will print per-sample output for rank 0 only, which2will print per-sample output for all ranks. For example, running withCUDECOMP_PERFORMANCE_REPORT_DETAIL=2will generate output like:Finally, users can set
CUDECOMP_PERFORMANCE_REPORT_WRITE_DIRto enable write this data to CSV files. Please refer to the documentation for details on this setting. As an example, a CSV file corresponding to the aggregated transpose performance data looks like:Reasonable effort has been made to ensure these timings are accurate, but users are still encouraged to use profiling tools like Nsight Systems if more detailed understanding of performance data is required.