[RFC][core][V1] generalize structured output manager and backends #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposing to move all Structured Outputs related logic into v1/structured_output and standardise the interfaces with structured outputs in the rest of the code. Currently there is structure outputs logic scattered and fragmented through the code base with logic in
Scheduler,StructuredOutputManagerandgpu_model_runner.Benefits:
init_batchcallback could be used to trigger the re-shuffling of the grammar mask asynchronously rather than synchronously in thegpu_model_runneras is currently implementedThis involves few changes to the existing backend logic for xgrammar and guidance in
vllm/v1/structured_output/backend_guidance.pyandvllm/v1/structured_output/backend_xgrammar.pywith the largest change being to move bitmasking logic from thegpu_model_runnertovllm/v1/structured_output/bitmasking_grammar.py.I have tested this with xgrammar and guidance.
WARNING: I have yet to refactor tpu_model_runner with this new logic but I think this will help clean up the code duplication between
gpu_model_runnerandtpu_model_runnerand add any tpu logic intobitmasking_grammar.py. I wanted to gather feedback before taking the time to make changes to thetpu_model_runner.