Skip to content

ARK v0.1.0 Known Bugs & Issues #35

@chhwang

Description

@chhwang
    • Executor::tensor_memcpy_host_to_device() will cause unknown error if the tensors on the host device is not sequential. We need more check about the tensor on the host or mabe need a python warpper for this (Improve Python interfaces #48)
    • Sometime if the tensor is padded, the allgather operation might overwrite the recv tensor, and the allreduce tensor will also be incorrect. (@chhwang: now send/recv checks contiguity)
    • Current layernorm and sofxmax operation is scheduled using a quite hack way, might needs for more update in the future. (Minor updates #59)
    • Layernorm need a recv dependency at its output (@chhwang: it already has)
    • [ ] Support both source and destination offsets in NetIbQp::stage_send() moved to the next version
    • When using python -m unittest discover -s . -p "test_*.py" to run all unittest, the snedrecv test will fail, but when we run them seperately, their will be no problem. Seems that in some cases the previous runtime context is not destroyed when one unittest finished and another unittest start. This problem also exist in the current main branch. (@chhwang: this is the test code's issue, won't fix for now)
    • [ ] Offsets of importing/exporting tensors are not properly handled moved to the next version
    • float matmul error rate seems too high but it's unclear if it is ARK's issue or the test code issue (@chhwang: this is not an issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions