Skip to content

Conversation

@zhiyongww
Copy link

Hi NCCL Team,

I'd like to share an enhancement I've made to nccl-tests and would appreciate your review.

Thank you!

sample output:

[root@rtptest1621.ftw6 ~]# NCCL_P2P_DISABLE=1 NCCL_DEBUG=WARN ./nccl_alltoall_perf -b 32M -e 128M -f 2 -n 1000 -g 8 --json
[rtptest1621][[31817,1],0][btl_openib_component.c:1704:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   rtptest1621
  Local device: mlx5_0
--------------------------------------------------------------------------
NCCL version 2.26.6+cudaCUDA_MAJOR.CUDA_MINOR
{"identifier": "nccl-tests structured output", "version": "2.26.6", "benchmark": [{"33554432": {"count": 1048576, "type": "float", "redop": "none", "root": "-1", "oop_time": 1494.278386, "oop_algbw": 22.455275, "oop_busbw": 19.648366, "oop_error": 0.000000, "ip_time": 1459.561667, "ip_algbw": 22.989390, "ip_busbw": 20.115716, "ip_error": "N/A"}, "67108864": {"count": 2097152, "type": "float", "redop": "none", "root": "-1", "oop_time": 2846.555628, "oop_algbw": 23.575462, "oop_busbw": 20.628529, "oop_error": 0.000000, "ip_time": 2849.237796, "ip_algbw": 23.553269, "ip_busbw": 20.609110, "ip_error": "N/A"}, "134217728": {"count": 4194304, "type": "float", "redop": "none", "root": "-1", "oop_time": 5613.751383, "oop_algbw": 23.908741, "oop_busbw": 20.920148, "oop_error": 0.000000, "ip_time": 5616.690443, "ip_algbw": 23.896230, "ip_busbw": 20.909201, "ip_error": "N/A"}}], "out_of_bounds_value": 0, "out_of_bounds_value_status": "OK", "avg_bus_bw": 20.471845, "avg_bus_bw_status": "OK"}

formatted json:

{
  "identifier": "nccl-tests structured output",
  "version": "2.26.6",
  "benchmark": [
    {
      "33554432": {
        "count": 1048576,
        "type": "float",
        "redop": "none",
        "root": "-1",
        "oop_time": 1494.278386,
        "oop_algbw": 22.455275,
        "oop_busbw": 19.648366,
        "oop_error": 0,
        "ip_time": 1459.561667,
        "ip_algbw": 22.98939,
        "ip_busbw": 20.115716,
        "ip_error": "N/A"
      },
      "67108864": {
        "count": 2097152,
        "type": "float",
        "redop": "none",
        "root": "-1",
        "oop_time": 2846.555628,
        "oop_algbw": 23.575462,
        "oop_busbw": 20.628529,
        "oop_error": 0,
        "ip_time": 2849.237796,
        "ip_algbw": 23.553269,
        "ip_busbw": 20.60911,
        "ip_error": "N/A"
      },
      "134217728": {
        "count": 4194304,
        "type": "float",
        "redop": "none",
        "root": "-1",
        "oop_time": 5613.751383,
        "oop_algbw": 23.908741,
        "oop_busbw": 20.920148,
        "oop_error": 0,
        "ip_time": 5616.690443,
        "ip_algbw": 23.89623,
        "ip_busbw": 20.909201,
        "ip_error": "N/A"
      }
    }
  ],
  "out_of_bounds_value": 0,
  "out_of_bounds_value_status": "OK",
  "avg_bus_bw": 20.471845,
  "avg_bus_bw_status": "OK"
}

with -N 2, after formatting json output

NCCL_P2P_DISABLE=1 NCCL_DEBUG=WARN ./nccl_gather_perf -b 32M -e 128M -f 2 -n 1000 -g 8 -d f8e4m3 -N 2 --json

{
  "identifier": "nccl-tests structured output",
  "version": "2.26.6",
  "benchmark": [
    {
      "33554432": {
        "count": 4194304,
        "type": "f8e4m3",
        "redop": "none",
        "root": "0",
        "oop_time": 1172.362777,
        "oop_algbw": 28.621202,
        "oop_busbw": 25.043552,
        "oop_error": 0,
        "ip_time": 1165.302078,
        "ip_algbw": 28.794621,
        "ip_busbw": 25.195294,
        "ip_error": 0
      },
      "67108864": {
        "count": 8388608,
        "type": "f8e4m3",
        "redop": "none",
        "root": "0",
        "oop_time": 2248.297517,
        "oop_algbw": 29.848747,
        "oop_busbw": 26.117654,
        "oop_error": 0,
        "ip_time": 2252.372999,
        "ip_algbw": 29.794738,
        "ip_busbw": 26.070396,
        "ip_error": 0
      },
      "134217728": {
        "count": 16777216,
        "type": "f8e4m3",
        "redop": "none",
        "root": "0",
        "oop_time": 4423.854938,
        "oop_algbw": 30.339541,
        "oop_busbw": 26.547098,
        "oop_error": 0,
        "ip_time": 4435.595384,
        "ip_algbw": 30.259236,
        "ip_busbw": 26.476832,
        "ip_error": 0
      }
    },
    {
      "33554432": {
        "count": 4194304,
        "type": "f8e4m3",
        "redop": "none",
        "root": "0",
        "oop_time": 1154.07066,
        "oop_algbw": 29.074851,
        "oop_busbw": 25.440494,
        "oop_error": 0,
        "ip_time": 1166.898772,
        "ip_algbw": 28.755221,
        "ip_busbw": 25.160818,
        "ip_error": 0
      },
      "67108864": {
        "count": 8388608,
        "type": "f8e4m3",
        "redop": "none",
        "root": "0",
        "oop_time": 2248.849333,
        "oop_algbw": 29.841423,
        "oop_busbw": 26.111245,
        "oop_error": 0,
        "ip_time": 2250.891268,
        "ip_algbw": 29.814352,
        "ip_busbw": 26.087558,
        "ip_error": 0
      },
      "134217728": {
        "count": 16777216,
        "type": "f8e4m3",
        "redop": "none",
        "root": "0",
        "oop_time": 4422.531622,
        "oop_algbw": 30.348619,
        "oop_busbw": 26.555042,
        "oop_error": 0,
        "ip_time": 4435.042864,
        "ip_algbw": 30.263006,
        "ip_busbw": 26.48013,
        "ip_error": 0
      }
    }
  ],
  "out_of_bounds_value": 0,
  "out_of_bounds_value_status": "OK",
  "avg_bus_bw": 25.940509,
  "avg_bus_bw_status": "OK"
}

@AddyLaddy
Copy link
Collaborator

Thanks for this suggestion. We had JSON support in some of our internal code so I ported that to keep the repos consistent.
I hope this features meets your requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants