enable tp for benchmark#43750
Conversation
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
|
@remi-or , pls help review, thx very much. |
|
hi, @remi-or any thought to enable tp, cp in the benchmark tool? |
remi-or
left a comment
There was a problem hiding this comment.
Some nits, but otherwise lgtm. Have you tried out the benchmarking with TP? And if so, how were the results? I am curious what applications you are targeting this for 🙂 !
actually I would like to leverage this benchmark tool in xpu to broader model, so I need to run bigger model(like moe serial). one card could not run such model for memory limitation. so I need to run with ep and tp with multiple-cards. also tp, ep support in kernels path is only in our radar. |
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
|
Ok, this looks good! I would like to test n my end before merging, will do so soon. My question was: did you manage to run the benchmarker in a distributed setting on your end? Or is this a small change needed for that but not enough to enable the feature? Thanks |
|
yes, I test by my side using torchrun --nproc-per-node 2 run_benchmarks.py --enable-tp, enough to enable the feature. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
enable tp in benchmark_v2, to ensure large model could run.