Skip to content

Conversation

@fynnsu
Copy link
Collaborator

@fynnsu fynnsu commented Dec 18, 2025

  • Add acceptance rate metric collection to tests/e2e/vllm/run_vllm.py
  • Add acceptance rate optional asserts to tests/e2e/vllm/utils.py
  • Add tests/e2e/vllm/test_gen_train_acceptance.py based on examples/data_generation_and_training/llama3_8b_sharegpt_5k.py, which trains a llama 3.1 8B model on 5k samples from sharegpt and then checks the acceptance rate on several test prompts. This test uses the functionality added to the above files.

Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
@github-actions
Copy link

github-actions bot commented Dec 18, 2025

📦 Build Artifacts Available
The build artifacts (`.whl` and `.tar.gz`) have been successfully generated and are available for download: https://github.com/vllm-project/speculators/actions/runs/20377405084/artifacts/4927042111.
They will be retained for up to 30 days.
Commit: 1f86169

Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Copy link
Collaborator

@shanjiaz shanjiaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! Do we want to test regression for different models? Maybe the models in examples?

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@fynnsu
Copy link
Collaborator Author

fynnsu commented Dec 19, 2025

Looks awesome! Do we want to test regression for different models? Maybe the models in examples?

Yes, but I think we need to review our compute budget for this because the current llama 3.1 8b on 5k samples test already takes half an hour on a single H100 gpu.

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question otherwise lgtm

return parser.parse_args()


def extract_metrics(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this different than how Rahul set-up extracting metrics from the logs through is benchmarking work? Is there anyway we could share that functionality?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a different approach. I got this approach from the vllm examples/offline_inference/spec_decode.py script.

The challenge is, this system uses vllm metrics and works because we're running vllm through the python api. Rahul's testing instead uses the cli to spin up a vllm instances which guidellm then interacts with. The advantage of the guidellm approach is that it allows us to simulate slightly more "real world" workloads and measure actual server response times. I can look into if there's a way to use the metrics system to get acceptance rates instead of the current log scrapping method, but either way I don't think we can easily combine the implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants