Add performance reporting feature. #75

romerojosh · 2025-07-09T22:30:06Z

This PR introduces a detailed performance reporting feature to cuDecomp, enabled with a new environment variable CUDECOMP_ENABLE_PERFORMANCE_REPORT. The purpose of this feature is to enable users to extract fine-grained performance information from transpose and halo operations invoked in their programs.

When this feature is enabled, cuDecomp will capture timing information inside all calls to cudecompTranspose* and cudecompUpdateHalo*. The timing information captured is the time spent in communication (alltoall, sendrecv) and time spent in local operations (including overheads). The number of samples stored is configurable via CUDECOMP_PERFORMANCE_REPORT_NSAMPLES and defaults to 20 samples per transpose/halo configuration. A performance report is printed to the terminal upon grid descriptor destruction. See the updated documentation for more configuration variables.

By default, only an aggregated performance report is printed. Here is an example of what that report looks like:

CUDECOMP:
CUDECOMP: ===== Performance Summary =====
CUDECOMP: Grid Configuration:
CUDECOMP:       Transpose backend: MPI_P2P
CUDECOMP:       Halo backend: MPI
CUDECOMP:       Process grid: [2, 2]
CUDECOMP:       Global dimensions: [256, 256, 256]
CUDECOMP:       Memory order: [0,1,2]; [1,2,0]; [2,0,1]
CUDECOMP:
CUDECOMP: Transpose Performance Data:
CUDECOMP:
CUDECOMP: operation    dtype  halo extents    padding         inplace  managed  samples  total     A2A       local     A2A BW   
CUDECOMP:                                                                                [ms]      [ms]      [ms]      [GB/s]   
CUDECOMP: ------------------------------------------------------------------------------------------------------------------------
CUDECOMP: TransposeXY  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.866     1.504     0.361     44.607   
CUDECOMP: TransposeYZ  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.867     1.505     0.361     44.580   
CUDECOMP: TransposeZY  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.876     1.503     0.373     44.646   
CUDECOMP: TransposeYX  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.879     1.507     0.373     44.543   
CUDECOMP: ================================
CUDECOMP:

This timings and bandwidths reported in this table are averaged across samples and all ranks.

To retrieve more verbose per-sample output, the environment variable CUDECOMP_PERFORMANCE_REPORT_DETAIL can be used. Setting this variable to 1 will print per-sample output for rank 0 only, which 2 will print per-sample output for all ranks. For example, running with CUDECOMP_PERFORMANCE_REPORT_DETAIL=2 will generate output like:

CUDECOMP:
CUDECOMP: ===== Performance Summary =====
CUDECOMP: Grid Configuration:
CUDECOMP:       Transpose backend: MPI_P2P
CUDECOMP:       Halo backend: MPI
CUDECOMP:       Process grid: [2, 2]
CUDECOMP:       Global dimensions: [256, 256, 256]
CUDECOMP:       Memory order: [0,1,2]; [1,2,0]; [2,0,1]
CUDECOMP:
CUDECOMP: Transpose Performance Data:
CUDECOMP:
CUDECOMP: operation    dtype  halo extents    padding         inplace  managed  samples  total     A2A       local     A2A BW   
CUDECOMP:                                                                                [ms]      [ms]      [ms]      [GB/s]   
CUDECOMP: ------------------------------------------------------------------------------------------------------------------------
CUDECOMP: TransposeXY  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.867     1.505     0.362     44.594   
CUDECOMP: TransposeYZ  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.874     1.513     0.361     44.359   
CUDECOMP: TransposeZY  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.876     1.504     0.372     44.624   
CUDECOMP: TransposeYX  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.884     1.512     0.372     44.393   
CUDECOMP:
CUDECOMP: Per-Sample Details:
CUDECOMP:
CUDECOMP: TransposeXY (dtype=Z, halo extents=[0,0,0]/[0,0,0], padding=[0,0,0]/[0,0,0], inplace=T, managed=F) samples:
CUDECOMP: rank   sample       total     A2A       local     A2A BW   
CUDECOMP:                     [ms]      [ms]      [ms]      [GB/s]   
CUDECOMP: 0      0            1.867     1.504     0.362     44.613   
CUDECOMP: 0      1            1.865     1.504     0.360     44.613   
CUDECOMP: 0      2            1.865     1.503     0.361     44.643   
CUDECOMP: 0      3            1.872     1.506     0.366     44.552   
CUDECOMP: 0      4            1.870     1.504     0.366     44.613   
CUDECOMP: 1      0            1.869     1.507     0.361     44.522   
CUDECOMP: 1      1            1.866     1.504     0.361     44.613   
CUDECOMP: 1      2            1.865     1.504     0.360     44.613   
CUDECOMP: 1      3            1.865     1.503     0.361     44.643   
CUDECOMP: 1      4            1.865     1.502     0.362     44.673   
CUDECOMP: 2      0            1.868     1.506     0.361     44.552   
CUDECOMP: 2      1            1.865     1.505     0.359     44.582   
CUDECOMP: 2      2            1.866     1.504     0.361     44.613   
CUDECOMP: 2      3            1.869     1.508     0.360     44.492   
CUDECOMP: 2      4            1.867     1.504     0.362     44.613   
CUDECOMP: 3      0            1.870     1.507     0.362     44.522   
CUDECOMP: 3      1            1.867     1.506     0.360     44.552   
CUDECOMP: 3      2            1.865     1.503     0.361     44.643   
CUDECOMP: 3      3            1.865     1.503     0.361     44.643   
CUDECOMP: 3      4            1.866     1.505     0.360     44.582   
CUDECOMP:
CUDECOMP: TransposeYZ (dtype=Z, halo extents=[0,0,0]/[0,0,0], padding=[0,0,0]/[0,0,0], inplace=T, managed=F) samples:
CUDECOMP: rank   sample       total     A2A       local     A2A BW   
CUDECOMP:                     [ms]      [ms]      [ms]      [GB/s]   
CUDECOMP: 0      0            1.869     1.504     0.365     44.613   
CUDECOMP: 0      1            1.864     1.503     0.360     44.643   
CUDECOMP: 0      2            1.864     1.504     0.359     44.613
... (output continues) ...

Finally, users can set CUDECOMP_PERFORMANCE_REPORT_WRITE_DIR to enable write this data to CSV files. Please refer to the documentation for details on this setting. As an example, a CSV file corresponding to the aggregated transpose performance data looks like:

$ cat cudecomp-perf-report-transpose-aggregated-tcomm_1-hcomm_1-pdims_2x2-gdims_256x256x256-memorder_012120201.csv
# Transpose backend: MPI_P2P
# Halo backend: MPI
# Process grid: [2, 2]
# Global dimensions: [256, 256, 256]
# Memory order: [0,1,2]; [1,2,0]; [2,0,1]
#
operation,dtype,input_halo_extents,output_halo_extents,input_padding,output_padding,inplace,managed,samples,total_ms,A2A_ms,local_ms,A2A_BW_GBps
TransposeXY,Z,"[0,0,0]","[0,0,0]","[0,0,0]","[0,0,0]",T,F,5,1.866,1.505,0.361,44.599
TransposeYZ,Z,"[0,0,0]","[0,0,0]","[0,0,0]","[0,0,0]",T,F,5,1.868,1.507,0.361,44.539
TransposeZY,Z,"[0,0,0]","[0,0,0]","[0,0,0]","[0,0,0]",T,F,5,1.877,1.504,0.372,44.607
TransposeYX,Z,"[0,0,0]","[0,0,0]","[0,0,0]","[0,0,0]",T,F,5,1.878,1.507,0.371,44.530

Reasonable effort has been made to ensure these timings are accurate, but users are still encouraged to use profiling tools like Nsight Systems if more detailed understanding of performance data is required.

…time.

…eport mode that is less verbose. Adding sample count and warmup knobs.

…ance reporting. Enable with CUDECOMP_PERFORMANCE_REPORT_DETAIL.

…nd handling.

romerojosh force-pushed the performance_reporting branch 2 times, most recently from 681dc37 to 4ee01d7 Compare July 18, 2025 17:08

romerojosh mentioned this pull request Jul 18, 2025

Pin clang format version for CI. #80

Merged

romerojosh added 13 commits July 18, 2025 11:34

Add option to print per-operation performance metrics during code run…

9fa1ee1

…time.

More work on performance reporting. Adding more comprehensive final r…

9e51178

…eport mode that is less verbose. Adding sample count and warmup knobs.

Replace existing per-op performance reporting with per-sample perform…

4d0b855

…ance reporting. Enable with CUDECOMP_PERFORMANCE_REPORT_DETAIL.

Add performance reporting for halo operations. Update env var names a…

f20b03c

…nd handling.

Some formatting changes.

aab4b53

Apply fixed sorting to performance report entries.

4c340d0

Refactoring and cleanup.

43cefd5

Remove erroneous quick return.

13dc563

wip

02b4087

wip

1ee01c2

Adding CSV file writing option. More cleanup.

6e8307f

Update transpose table halo/padding columns.

6fe39b4

Run clang-format.

c98eecb

romerojosh force-pushed the performance_reporting branch from f0edb4e to c98eecb Compare July 18, 2025 18:34

romerojosh merged commit 9b8e01b into main Jul 22, 2025
5 checks passed

romerojosh mentioned this pull request Jul 22, 2025

Fix C++ std::filesystem linking for older GCC toolchains. #81

Merged

romerojosh deleted the performance_reporting branch July 23, 2025 16:36

p-costa mentioned this pull request Aug 5, 2025

Avoid unecessary copies in the Poisson solver. CaNS-World/CaNS#174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add performance reporting feature. #75

Add performance reporting feature. #75

Uh oh!

romerojosh commented Jul 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add performance reporting feature. #75

Add performance reporting feature. #75

Uh oh!

Conversation

romerojosh commented Jul 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants