Skip to content

Unexpected GPU Memory Usage Spike During Model Loading #36

@shiertier

Description

@shiertier

Description:

I am experiencing an unexpected spike in GPU memory usage when loading the Meta-Llama-3.1-8B-Instruct-AWQ-INT4 model using the vLLM framework. Initially, the GPU memory usage is around 2-3 GB per GPU, but it suddenly increases to approximately 28 GB per GPU after the model is fully loaded(full GPU memory is 32GB).

Steps to Reproduce:

VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=2,3,4,5 /root/miniconda3/bin/python /root/miniconda3/bin/enova enode run --model Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --host 0.0.0.0 --vllm_mode openai --tensor_parallel_size 4

Expected Behavior:

The GPU memory usage should remain stable after the model is loaded, ideally around the initial 2-3 GB per GPU.

Actual Behavior:

After the model is fully loaded, the GPU memory usage spikes to approximately 28 GB per GPU.

Logs:

INFO 10-07 20:28:04 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 10-07 20:28:04 config.py:729] Defaulting to use mp for distributed inference
WARNING 10-07 20:28:04 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 10-07 20:28:04 config.py:820] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 10-07 20:28:04 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='Meta-Llama-3.1-8B-Instruct-AWQ-INT4', speculative_config=None, tokenizer='Meta-Llama-3.1-8B-Instruct-AWQ-INT4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=Meta-Llama-3.1-8B-Instruct-AWQ-INT4, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 10-07 20:28:05 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-07 20:28:05 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 1s.
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 2s.
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 4s.
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[rank1]:[W1007 20:28:12.734201822 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank2]:[W1007 20:28:12.801215263 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank3]:[W1007 20:28:12.830036780 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W1007 20:28:12.836890426 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:12 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:12 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:12 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:12 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:12 utils.py:841] Found nccl from library libnccl.so.2
INFO 10-07 20:28:12 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:12 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-07 20:28:12 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=1725714) WARNING 10-07 20:28:13 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=1725715) WARNING 10-07 20:28:13 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=1725713) WARNING 10-07 20:28:13 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 10-07 20:28:13 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 10-07 20:28:13 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fd18c10a830>, local_subscribe_port=38211, remote_subscribe_port=None)
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:13 model_runner.py:720] Starting to load model Meta-Llama-3.1-8B-Instruct-AWQ-INT4...
INFO 10-07 20:28:13 model_runner.py:720] Starting to load model Meta-Llama-3.1-8B-Instruct-AWQ-INT4...
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:13 model_runner.py:720] Starting to load model Meta-Llama-3.1-8B-Instruct-AWQ-INT4...
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:13 model_runner.py:720] Starting to load model Meta-Llama-3.1-8B-Instruct-AWQ-INT4...
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.67it/s]

(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:14 model_runner.py:732] Loading model weights took 1.3931 GB
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:14 model_runner.py:732] Loading model weights took 1.3931 GB
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:14 model_runner.py:732] Loading model weights took 1.3931 GB
INFO 10-07 20:28:14 model_runner.py:732] Loading model weights took 1.3931 GB
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 8s.
INFO 10-07 20:28:15 distributed_gpu_executor.py:56] # GPU blocks: 52860, # CPU blocks: 8192
INFO 10-07 20:28:18 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-07 20:28:18 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:18 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:18 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:18 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:18 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:18 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:18 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 16s.
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 32s.
INFO 10-07 20:28:47 model_runner.py:1225] Graph capturing finished in 29 secs.
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:47 model_runner.py:1225] Graph capturing finished in 28 secs.
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:47 model_runner.py:1225] Graph capturing finished in 28 secs.
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:47 model_runner.py:1225] Graph capturing finished in 28 secs.
[2024-10-07 Mon 20:28:47.353][INFO][MainProcess][1725177][MainThread][server][/root/miniconda3/lib/python3.10/site-packages/enova/serving/backend/vllm.py:78 - _create_app()] [trace_id: dff43d8c8f14416e93ffad0f2179bfdd]|CONFIG.vllm: {'tensor_parallel_size': 4, 'trust_remote_code': True}
[2024-10-07 Mon 20:28:47.354][INFO][MainProcess][1725177][MainThread][server][/root/miniconda3/lib/python3.10/site-packages/enova/serving/backend/base.py:39 - print_run_info()] [trace_id: dff43d8c8f14416e93ffad0f2179bfdd]|host: 0.0.0.0, port: 9199
INFO:     Started server process [1725177]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9199 (Press CTRL+C to quit)

At this moment, the VRAM of all 4 GPUs is fully utilized at 28.2GB.

Using Meta-Llama-3.1-70B-Instruct-AWQ-INT4, the situation is similar, but the GPU memory usage is around 29-30GB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions