Unexpected GPU Memory Usage Spike During Model Loading

#### Description:

I am experiencing an unexpected spike in GPU memory usage when loading the `Meta-Llama-3.1-8B-Instruct-AWQ-INT4` model using the vLLM framework. Initially, the GPU memory usage is around 2-3 GB per GPU, but it suddenly increases to approximately 28 GB per GPU after the model is fully loaded（full GPU memory is 32GB）.

#### Steps to Reproduce:

  ```bash
  VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=2,3,4,5 /root/miniconda3/bin/python /root/miniconda3/bin/enova enode run --model Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --host 0.0.0.0 --vllm_mode openai --tensor_parallel_size 4
  ```

#### Expected Behavior:
The GPU memory usage should remain stable after the model is loaded, ideally around the initial 2-3 GB per GPU.

#### Actual Behavior:
After the model is fully loaded, the GPU memory usage spikes to approximately 28 GB per GPU.

#### Logs:
```plaintext
INFO 10-07 20:28:04 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 10-07 20:28:04 config.py:729] Defaulting to use mp for distributed inference
WARNING 10-07 20:28:04 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 10-07 20:28:04 config.py:820] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 10-07 20:28:04 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='Meta-Llama-3.1-8B-Instruct-AWQ-INT4', speculative_config=None, tokenizer='Meta-Llama-3.1-8B-Instruct-AWQ-INT4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=Meta-Llama-3.1-8B-Instruct-AWQ-INT4, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 10-07 20:28:05 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-07 20:28:05 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 1s.
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 2s.
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 4s.
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[rank1]:[W1007 20:28:12.734201822 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank2]:[W1007 20:28:12.801215263 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank3]:[W1007 20:28:12.830036780 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W1007 20:28:12.836890426 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:12 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:12 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:12 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:12 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:12 utils.py:841] Found nccl from library libnccl.so.2
INFO 10-07 20:28:12 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:12 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-07 20:28:12 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=1725714) WARNING 10-07 20:28:13 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=1725715) WARNING 10-07 20:28:13 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=1725713) WARNING 10-07 20:28:13 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 10-07 20:28:13 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 10-07 20:28:13 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fd18c10a830>, local_subscribe_port=38211, remote_subscribe_port=None)
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:13 model_runner.py:720] Starting to load model Meta-Llama-3.1-8B-Instruct-AWQ-INT4...
INFO 10-07 20:28:13 model_runner.py:720] Starting to load model Meta-Llama-3.1-8B-Instruct-AWQ-INT4...
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:13 model_runner.py:720] Starting to load model Meta-Llama-3.1-8B-Instruct-AWQ-INT4...
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:13 model_runner.py:720] Starting to load model Meta-Llama-3.1-8B-Instruct-AWQ-INT4...
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.67it/s]

(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:14 model_runner.py:732] Loading model weights took 1.3931 GB
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:14 model_runner.py:732] Loading model weights took 1.3931 GB
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:14 model_runner.py:732] Loading model weights took 1.3931 GB
INFO 10-07 20:28:14 model_runner.py:732] Loading model weights took 1.3931 GB
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 8s.
INFO 10-07 20:28:15 distributed_gpu_executor.py:56] # GPU blocks: 52860, # CPU blocks: 8192
INFO 10-07 20:28:18 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-07 20:28:18 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:18 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:18 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:18 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:18 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:18 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:18 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 16s.
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to otel-collector:4317, retrying in 32s.
INFO 10-07 20:28:47 model_runner.py:1225] Graph capturing finished in 29 secs.
(VllmWorkerProcess pid=1725714) INFO 10-07 20:28:47 model_runner.py:1225] Graph capturing finished in 28 secs.
(VllmWorkerProcess pid=1725715) INFO 10-07 20:28:47 model_runner.py:1225] Graph capturing finished in 28 secs.
(VllmWorkerProcess pid=1725713) INFO 10-07 20:28:47 model_runner.py:1225] Graph capturing finished in 28 secs.
[2024-10-07 Mon 20:28:47.353][INFO][MainProcess][1725177][MainThread][server][/root/miniconda3/lib/python3.10/site-packages/enova/serving/backend/vllm.py:78 - _create_app()] [trace_id: dff43d8c8f14416e93ffad0f2179bfdd]|CONFIG.vllm: {'tensor_parallel_size': 4, 'trust_remote_code': True}
[2024-10-07 Mon 20:28:47.354][INFO][MainProcess][1725177][MainThread][server][/root/miniconda3/lib/python3.10/site-packages/enova/serving/backend/base.py:39 - print_run_info()] [trace_id: dff43d8c8f14416e93ffad0f2179bfdd]|host: 0.0.0.0, port: 9199
INFO:     Started server process [1725177]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9199 (Press CTRL+C to quit)
```

At this moment, the VRAM of all 4 GPUs is fully utilized at 28.2GB.

Using Meta-Llama-3.1-70B-Instruct-AWQ-INT4, the situation is similar, but the GPU memory usage is around 29-30GB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected GPU Memory Usage Spike During Model Loading #36

Description:

Steps to Reproduce:

Expected Behavior:

Actual Behavior:

Logs:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected GPU Memory Usage Spike During Model Loading #36

Description

Description:

Steps to Reproduce:

Expected Behavior:

Actual Behavior:

Logs:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions