Skip to content

Conversation

@romerojosh
Copy link
Collaborator

This PR introduces a detailed performance reporting feature to cuDecomp, enabled with a new environment variable CUDECOMP_ENABLE_PERFORMANCE_REPORT. The purpose of this feature is to enable users to extract fine-grained performance information from transpose and halo operations invoked in their programs.

When this feature is enabled, cuDecomp will capture timing information inside all calls to cudecompTranspose* and cudecompUpdateHalo*. The timing information captured is the time spent in communication (alltoall, sendrecv) and time spent in local operations (including overheads). The number of samples stored is configurable via CUDECOMP_PERFORMANCE_REPORT_NSAMPLES and defaults to 20 samples per transpose/halo configuration. A performance report is printed to the terminal upon grid descriptor destruction. See the updated documentation for more configuration variables.

By default, only an aggregated performance report is printed. Here is an example of what that report looks like:

CUDECOMP:
CUDECOMP: ===== Performance Summary =====
CUDECOMP: Grid Configuration:
CUDECOMP:       Transpose backend: MPI_P2P
CUDECOMP:       Halo backend: MPI
CUDECOMP:       Process grid: [2, 2]
CUDECOMP:       Global dimensions: [256, 256, 256]
CUDECOMP:       Memory order: [0,1,2]; [1,2,0]; [2,0,1]
CUDECOMP:
CUDECOMP: Transpose Performance Data:
CUDECOMP:
CUDECOMP: operation    dtype  halo extents    padding         inplace  managed  samples  total     A2A       local     A2A BW   
CUDECOMP:                                                                                [ms]      [ms]      [ms]      [GB/s]   
CUDECOMP: ------------------------------------------------------------------------------------------------------------------------
CUDECOMP: TransposeXY  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.866     1.504     0.361     44.607   
CUDECOMP: TransposeYZ  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.867     1.505     0.361     44.580   
CUDECOMP: TransposeZY  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.876     1.503     0.373     44.646   
CUDECOMP: TransposeYX  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.879     1.507     0.373     44.543   
CUDECOMP: ================================
CUDECOMP:

This timings and bandwidths reported in this table are averaged across samples and all ranks.

To retrieve more verbose per-sample output, the environment variable CUDECOMP_PERFORMANCE_REPORT_DETAIL can be used. Setting this variable to 1 will print per-sample output for rank 0 only, which 2 will print per-sample output for all ranks. For example, running with CUDECOMP_PERFORMANCE_REPORT_DETAIL=2 will generate output like:

CUDECOMP:
CUDECOMP: ===== Performance Summary =====
CUDECOMP: Grid Configuration:
CUDECOMP:       Transpose backend: MPI_P2P
CUDECOMP:       Halo backend: MPI
CUDECOMP:       Process grid: [2, 2]
CUDECOMP:       Global dimensions: [256, 256, 256]
CUDECOMP:       Memory order: [0,1,2]; [1,2,0]; [2,0,1]
CUDECOMP:
CUDECOMP: Transpose Performance Data:
CUDECOMP:
CUDECOMP: operation    dtype  halo extents    padding         inplace  managed  samples  total     A2A       local     A2A BW   
CUDECOMP:                                                                                [ms]      [ms]      [ms]      [GB/s]   
CUDECOMP: ------------------------------------------------------------------------------------------------------------------------
CUDECOMP: TransposeXY  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.867     1.505     0.362     44.594   
CUDECOMP: TransposeYZ  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.874     1.513     0.361     44.359   
CUDECOMP: TransposeZY  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.876     1.504     0.372     44.624   
CUDECOMP: TransposeYX  Z      [0,0,0]/[0,0,0] [0,0,0]/[0,0,0] T        F        5        1.884     1.512     0.372     44.393   
CUDECOMP:
CUDECOMP: Per-Sample Details:
CUDECOMP:
CUDECOMP: TransposeXY (dtype=Z, halo extents=[0,0,0]/[0,0,0], padding=[0,0,0]/[0,0,0], inplace=T, managed=F) samples:
CUDECOMP: rank   sample       total     A2A       local     A2A BW   
CUDECOMP:                     [ms]      [ms]      [ms]      [GB/s]   
CUDECOMP: 0      0            1.867     1.504     0.362     44.613   
CUDECOMP: 0      1            1.865     1.504     0.360     44.613   
CUDECOMP: 0      2            1.865     1.503     0.361     44.643   
CUDECOMP: 0      3            1.872     1.506     0.366     44.552   
CUDECOMP: 0      4            1.870     1.504     0.366     44.613   
CUDECOMP: 1      0            1.869     1.507     0.361     44.522   
CUDECOMP: 1      1            1.866     1.504     0.361     44.613   
CUDECOMP: 1      2            1.865     1.504     0.360     44.613   
CUDECOMP: 1      3            1.865     1.503     0.361     44.643   
CUDECOMP: 1      4            1.865     1.502     0.362     44.673   
CUDECOMP: 2      0            1.868     1.506     0.361     44.552   
CUDECOMP: 2      1            1.865     1.505     0.359     44.582   
CUDECOMP: 2      2            1.866     1.504     0.361     44.613   
CUDECOMP: 2      3            1.869     1.508     0.360     44.492   
CUDECOMP: 2      4            1.867     1.504     0.362     44.613   
CUDECOMP: 3      0            1.870     1.507     0.362     44.522   
CUDECOMP: 3      1            1.867     1.506     0.360     44.552   
CUDECOMP: 3      2            1.865     1.503     0.361     44.643   
CUDECOMP: 3      3            1.865     1.503     0.361     44.643   
CUDECOMP: 3      4            1.866     1.505     0.360     44.582   
CUDECOMP:
CUDECOMP: TransposeYZ (dtype=Z, halo extents=[0,0,0]/[0,0,0], padding=[0,0,0]/[0,0,0], inplace=T, managed=F) samples:
CUDECOMP: rank   sample       total     A2A       local     A2A BW   
CUDECOMP:                     [ms]      [ms]      [ms]      [GB/s]   
CUDECOMP: 0      0            1.869     1.504     0.365     44.613   
CUDECOMP: 0      1            1.864     1.503     0.360     44.643   
CUDECOMP: 0      2            1.864     1.504     0.359     44.613
... (output continues) ...

Finally, users can set CUDECOMP_PERFORMANCE_REPORT_WRITE_DIR to enable write this data to CSV files. Please refer to the documentation for details on this setting. As an example, a CSV file corresponding to the aggregated transpose performance data looks like:

$ cat cudecomp-perf-report-transpose-aggregated-tcomm_1-hcomm_1-pdims_2x2-gdims_256x256x256-memorder_012120201.csv
# Transpose backend: MPI_P2P
# Halo backend: MPI
# Process grid: [2, 2]
# Global dimensions: [256, 256, 256]
# Memory order: [0,1,2]; [1,2,0]; [2,0,1]
#
operation,dtype,input_halo_extents,output_halo_extents,input_padding,output_padding,inplace,managed,samples,total_ms,A2A_ms,local_ms,A2A_BW_GBps
TransposeXY,Z,"[0,0,0]","[0,0,0]","[0,0,0]","[0,0,0]",T,F,5,1.866,1.505,0.361,44.599
TransposeYZ,Z,"[0,0,0]","[0,0,0]","[0,0,0]","[0,0,0]",T,F,5,1.868,1.507,0.361,44.539
TransposeZY,Z,"[0,0,0]","[0,0,0]","[0,0,0]","[0,0,0]",T,F,5,1.877,1.504,0.372,44.607
TransposeYX,Z,"[0,0,0]","[0,0,0]","[0,0,0]","[0,0,0]",T,F,5,1.878,1.507,0.371,44.530

Reasonable effort has been made to ensure these timings are accurate, but users are still encouraged to use profiling tools like Nsight Systems if more detailed understanding of performance data is required.

@romerojosh romerojosh force-pushed the performance_reporting branch 2 times, most recently from 681dc37 to 4ee01d7 Compare July 18, 2025 17:08
@romerojosh romerojosh force-pushed the performance_reporting branch from f0edb4e to c98eecb Compare July 18, 2025 18:34
@romerojosh romerojosh merged commit 9b8e01b into main Jul 22, 2025
5 checks passed
@romerojosh romerojosh deleted the performance_reporting branch July 23, 2025 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants