[metrics] update metrics markdown file #4061
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
更新metrics指标说明文件,更改个别metrics类型
添加以下内容
|
fastdeploy:prefix_cache_token_num| Counter | 前缀缓存token总数 | 个 ||
fastdeploy:prefix_gpu_cache_token_num| Counter | 位于GPU上的前缀缓存token总数 | 个 ||
fastdeploy:prefix_cpu_cache_token_num| Counter | 位于GPU上的前缀缓存token总数 | 个 ||
fastdeploy:batch_size| Gauge | 推理时的真实批处理大小 | 个 ||
fastdeploy:max_batch_size| Gauge | 服务启动时确定的最大批处理大小 | 个 ||
fastdeploy:available_gpu_block_num| Gauge | 缓存中可用的GPU块数量(包含尚未正式释放的前缀缓存块)| 个 ||
fastdeploy:free_gpu_block_num| Gauge | 缓存中的可用块数 | 个 ||
fastdeploy:max_gpu_block_num| Gauge | 服务启动时确定的总块数 | 个 ||
available_gpu_resource| Gauge | 可用块占比,即可用GPU块数量 / 最大GPU块数量| 个 ||
fastdeploy:requests_number| Counter | 已接收的请求总数 | 个 ||
fastdeploy:send_cache_failed_num| Counter | 发送缓存失败的总次数 | 个 ||
fastdeploy:first_token_latency| Gauge | 最近一次生成首token耗时 | 秒 ||
fastdeploy:infer_latency| Gauge | 最近一次生成单个token的耗时 | 秒 ||
fastdeploy:prefix_cache_token_num| Counter | Total number of cached tokens | Count ||
fastdeploy:prefix_gpu_cache_token_num| Counter | Total number of cached tokens on GPU | Count ||
fastdeploy:prefix_cpu_cache_token_num| Counter | Total number of cached tokens on CPU | Count ||
fastdeploy:batch_size| Gauge | Real batch size during inference | Count ||
fastdeploy:max_batch_size| Gauge | Maximum batch size determined when service started | Count ||
fastdeploy:available_gpu_block_num| Gauge | Number of available gpu blocks in cache, including prefix caching blocks that are not officially released | Count ||
fastdeploy:free_gpu_block_num| Gauge | Number of free blocks in cache | Count ||
fastdeploy:max_gpu_block_num| Gauge | Number of total blocks determined when service started| Count ||
available_gpu_resource| Gauge | Available blocks percentage, i.e. available_gpu_block_num / max_gpu_block_num | Count ||
fastdeploy:requests_number| Counter | Total number of requests received | Count ||
fastdeploy:send_cache_failed_num| Counter | Total number of failures of sending cache | Count ||
fastdeploy:first_token_latency| Gauge | Latest time to generate first token in seconds | Seconds ||
fastdeploy:infer_latency| Gauge | Latest time to generate one token in seconds | Seconds |