-
Notifications
You must be signed in to change notification settings - Fork 23
End to end acceptance rate regression test #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
|
📦 Build Artifacts Available |
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
357a855 to
caea2c2
Compare
shanjiaz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome! Do we want to test regression for different models? Maybe the models in examples?
brian-dellabetta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
Yes, but I think we need to review our compute budget for this because the current llama 3.1 8b on 5k samples test already takes half an hour on a single H100 gpu. |
dsikka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question otherwise lgtm
| return parser.parse_args() | ||
|
|
||
|
|
||
| def extract_metrics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this different than how Rahul set-up extracting metrics from the logs through is benchmarking work? Is there anyway we could share that functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a different approach. I got this approach from the vllm examples/offline_inference/spec_decode.py script.
The challenge is, this system uses vllm metrics and works because we're running vllm through the python api. Rahul's testing instead uses the cli to spin up a vllm instances which guidellm then interacts with. The advantage of the guidellm approach is that it allows us to simulate slightly more "real world" workloads and measure actual server response times. I can look into if there's a way to use the metrics system to get acceptance rates instead of the current log scrapping method, but either way I don't think we can easily combine the implementations.
tests/e2e/vllm/run_vllm.pytests/e2e/vllm/utils.pytests/e2e/vllm/test_gen_train_acceptance.pybased onexamples/data_generation_and_training/llama3_8b_sharegpt_5k.py, which trains a llama 3.1 8B model on 5k samples from sharegpt and then checks the acceptance rate on several test prompts. This test uses the functionality added to the above files.