Skip to content

[bug] CustomAllReduceComm swapInternalBuffer is not safe (modifying const pointer). #671

@rkindi

Description

@rkindi

Branch/Tag/Commit

main

Docker Image Version

N/A

GPU name

A100

CUDA Driver

N/A

Reproduced Steps

This line is not safe because it is writing to a Tensor structs's data field which is a const void* (modifying constant value is undefined behavior). When I run a standalone script to test the custom all reduce, I can print the tensor's data attribute before and after the call to swapInternalBuffer and see that no change is made. I include my script below:

repro_issue671_fastertransformer.zip

Instructions:

- python3 make_npy_tensors.py
- Run main.cu
- python3 validate_npy_tensors.py

Output from my machine for main.cu:

ar_out_buffer.data (before): 0x7f65d5002400.
ar_out_buffer.data (after): 0x7f65d5002400.
DONE

We can see the data pointer of the Tensor is not changed. This prevents us from being able to use custom all reduce as there is no way to write to the all reduce input buffers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions