[Draft] Tensor Parallel support to llama.cpp#9648
[Draft] Tensor Parallel support to llama.cpp#9648ClarkChin08 wants to merge 2 commits intoggml-org:masterfrom
Conversation
Signed-off-by: Chen Xi <xi2chen@intel.com>
Signed-off-by: Chen Xi <xi2chen@intel.com>
|
#9086 Refer to this issue for detailed design. |
|
@ClarkChin08 Is it possible to update the guide/doc to explain how to use this feature:
Thank you! |
| list(APPEND GGML_EXTRA_LIBS_PRIVATE DNNL::dnnl) | ||
| endif() | ||
|
|
||
| set(oneCCL_DIR "/opt/intel/oneapi/ccl/latest/lib/cmake/oneCCL") |
There was a problem hiding this comment.
The real oneapi path is not always in /opt/intel/oneapi/.
Please use ENV{ONEAPI_ROOT} which is mandatory env variable in cmakefile.
Same for following script
| find_library(MPI_LIBRARY mpi HINTS ${MPI_LIBRARY_PATH}) | ||
| find_library(ONECCL_LIBRARY ccl HINTS ${ONECCL_LIBRARY_PATH}) | ||
| # find_package(oneCCL REQUIRED) | ||
| message("-- oneCCL found") |
There was a problem hiding this comment.
Add script for not found oneCCL.
oneCCL is not included in oneAPI base toolkit, please print the message to guide user how to install it.
| return -1; | ||
| } | ||
|
|
||
| inline int get_rank() { return _rank; } |
There was a problem hiding this comment.
These new functions have no relationship with DPCT.
It's better to move the ggml-sycl/src.
Recommend to reduce the dependence on DPCT code.
| _cpu_device = _devs.size() - 1; | ||
| } | ||
| } | ||
| init_ccl(); |
There was a problem hiding this comment.
mv this init() function to ggml-sycl/src.
| enum tensor_parallel_mode { | ||
| TENSOR_NO_CHANGE, | ||
| TENSOR_SPLIT_BY_ROW, | ||
| TENSOR_SPLIT_BY_COLUMN, | ||
| TENSOR_KEEPED_ON_MASTER | ||
| }; |
There was a problem hiding this comment.
Changes to the common ggml code should not be made unless absolutely necessary, which is not likely to be the case here. We already have a way to handle this with custom buffer types like the existing CUDA and SYCL split buffer types. You can extend this model instead by creating a different buffer type for tensors split by column. The "tensors kept on master" is just the default buffer type.
There was a problem hiding this comment.
I don't think it is possible to do #9086 (comment) with only backend extra buffer.
| find_library(ONECCL_LIBRARY ccl HINTS ${ONECCL_LIBRARY_PATH}) | ||
| # find_package(oneCCL REQUIRED) | ||
| message("-- oneCCL found") | ||
| set(GGML_EXTRA_LIBS ${GGML_EXTRA_LIBS} ${MPI_LIBRARY_PATH} ${ONECCL_LIBRARY_PATH}) |
There was a problem hiding this comment.
GGML_EXTRA_LIBS was recently split into GGML_EXTRA_LIBS_PUBLIC and GGML_EXTRA_LIBS_PRIVATE, so I think the line above won't work anymore
Also why there are paths to the lib directories inside this variable instead of found mpi/ccl libraries?
| llama_model_loader ml(fname, params.use_mmap, params.check_tensors, params.kv_overrides); | ||
|
|
||
| model.hparams.vocab_only = params.vocab_only; | ||
| if (params.tensor_split == LLAMA_SPLIT_MODE_TENSOR) { |
There was a problem hiding this comment.
Shouldn't it be params.split_mode instead of params.tensor_split?
|
hello - was this feature completed? |
|
@ClarkChin08, hello - was this feature completed? |
|
Hi, thanks and appreciate the work. It would be great to have this feature added/completed, which will bring great performance for multi gpu setup, similar to what vllm already has. |
|
This looks really interesting! Having tp support like vllm does would bring some great speed ups! |
|
looking forward to having this feature. |
|
Just a bump, this feature would be really great for the community. |
|
I suspect the OP has abandoned development and this feature is incomplete. |
Add tensor parallel support to llama.cpp, still draft code now.