Adding VLM pipeline#234
Conversation
Signed-off-by: Dipankar Sarkar <quic_dipankar@quicinc.com>
| SEQ_LEN = 32 | ||
| CTX_LEN = 32 | ||
| SEQ_LEN = 1024 | ||
| CTX_LEN = 1280 |
There was a problem hiding this comment.
Changing this here, will affect the causal_lm models, specify another set of constants for vlm models.
There was a problem hiding this comment.
Since the above seq_len ctx_len is not generalized value, when considering other vlms, keeping it inside the specific function is better, I think.
There was a problem hiding this comment.
resolved in new patch
|
|
||
| ONNX_EXPORT_EXAMPLE_BATCH_SIZE = 1 | ||
| ONNX_EXPORT_EXAMPLE_SEQ_LEN = 32 | ||
| ONNX_EXPORT_EXAMPLE_SEQ_LEN = 1024 |
There was a problem hiding this comment.
Same here, please verify and make sure the existing causalLM pipeline is not broken
There was a problem hiding this comment.
resolved in new patch
There was a problem hiding this comment.
Keep name test_image_text_to_text_models
There was a problem hiding this comment.
resolved in new patch
| # raise TypeError("missing required argument: 'full_batch_size'") | ||
|
|
||
| # if kv_cache_batch_size and not full_batch_size: | ||
| # raise ValueError( |
There was a problem hiding this comment.
avoid any commented lines in code
There was a problem hiding this comment.
resolved in new patch
|
Please add documentation and an example script. |
quic-amitraj
left a comment
There was a problem hiding this comment.
I observed code cleaning yet to be done. Please address all the comments and remove all the commented and unnecessary lines. Also update the docstring accordingly.
| from QEfficient import QEFFAutoModelForImageTextToText | ||
| from transformers import AutoTokenizer | ||
|
|
||
| model_name = "llava" |
| # warnings.warn( | ||
| # "full_batch_size argument is deprecated. Use continuous_batching=True instead.", DeprecationWarning, 2 | ||
| # ) | ||
| # breakpoint() |
There was a problem hiding this comment.
Remove commented codes.
| from transformers import AutoTokenizer | ||
|
|
||
| # Initialize the model using from_pretrained similar to transformers.AutoModelForCausalLM | ||
| model_name = "gpt2" |
|
Verify the PR on |
| from QEfficient.transformers.quantizers.quant_transforms import AwqToMatmulNbitsTransform, GPTQToMatmulNbitsTransform | ||
| from QEfficient.utils import constants, get_padding_shape_from_config | ||
|
|
||
| # from QEfficient.transformers.models.phi3_vision.modeling_phi3_vision import Phi3VModelWrapper |
There was a problem hiding this comment.
Please remove unused imports
|
|
||
| class QEFFAutoModelForImageTextToText(QEFFTransformersBase): | ||
| """ | ||
| The QEFF class is designed for manipulating any causal language model from the HuggingFace hub. |
There was a problem hiding this comment.
Please update the doc string. Doc string is referring to causal model
| from transformers import AutoTokenizer | ||
|
|
||
| model_name = "llava" | ||
| model = QEFFAutoModelForCausalLM.from_pretrained(model_name, num_hidden_layers=2) |
There was a problem hiding this comment.
Why is it using QEFFAutoModelForCausalLM? Can you update it with right scripts?
| # warnings.warn( | ||
| # "full_batch_size argument is deprecated. Use continuous_batching=True instead.", DeprecationWarning, 2 | ||
| # ) | ||
| # breakpoint() |
There was a problem hiding this comment.
Please remove the debugging/commented lines
| self.continuous_batching = continuous_batching | ||
| self.is_tlm = is_tlm | ||
| self.pad_token_id = model.config.pad_token_id | ||
| self.ctx_len = 1280 |
There was a problem hiding this comment.
Why Ctx len is hardcoded?
| :ctx_len (int): Maximum context length to compile the model. | ||
| :n_layers (int): Number of layers for the Model. | ||
| """ | ||
| # replace_transformers_quantizers() |
There was a problem hiding this comment.
Please remove unwanted commented lines
| streamer = TextStreamer(processor) | ||
| # Testing for Phi-3.5 only atm | ||
| inputs = _generate_inputs(model_hf, processor) | ||
| breakpoint() |
| num_hidden_layers=n_layer, | ||
| _attn_implementation="eager", | ||
| trust_remote_code=True, | ||
| # Check if this works |
There was a problem hiding this comment.
please remove unwanted commented line from this method
| model_hf, _ = load_vlm_model(model_config) | ||
| # Load processor instead | ||
| processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) | ||
| # config = model_hf.config |
There was a problem hiding this comment.
remove unwanted commented line
| qeff_model.generate(inputs, streamer, device_ids=[0], runtime_ai100=False) | ||
| # cloud_ai_100_tokens = exec_info[0] # Because we always run for single input and single batch size | ||
| # gen_len = ort_tokens.shape[-1] | ||
| # assert ( |
There was a problem hiding this comment.
Why are all asserts commented out? The objective of this method is to test the output correctness of native pytorch model output vs transformed pytorch output vs ORT output vs ai100 output. Please do add all the tests back.
Signed-off-by: Dipankar Sarkar <quic_dipankar@quicinc.com>
| from QEfficient.utils import hf_download | ||
|
|
||
|
|
||
| def load_vlm_model(model_config): |
There was a problem hiding this comment.
Change name to image_text_to_text
| ) | ||
| inputs["attention_mask"] = torch.nn.functional.pad( | ||
| inputs["attention_mask"], (0, 1024 - input_ids_size), "constant", 0 | ||
| ) |
There was a problem hiding this comment.
The values 1024 have to be replaced with constant.seq_len.
|
Already addressed in #267 |
We were able to create automodel for vlm specifically for phi3-vision
Supports available are
TODO
Transformer model Clip needs to be set to eager manually for the test_vlm_model to run