perceptron: Isaac-0.1 implementation#40962
perceptron: Isaac-0.1 implementation#40962AkshatSh wants to merge 100 commits intohuggingface:mainfrom
Conversation
|
Nice, lmk if you need any help or a quick review 🤗 |
|
@zucchini-nlp - we are just closing this out feel free to give us a quick review while we just finish up testing! |
|
amazing, reviewing on Monday :) |
There was a problem hiding this comment.
Nice job. I did not review everything and went over to check if the PR follows transformers API. The most important comments are:
- We need to add image processor class and wrap image transforms inside it
- Is the "tensorstream" required to get the model running? AFAIU we can discard it which is prob what we'll need to do. Transformers does not support streaming of multimodals yet
- The model structure has to follow standards of other VLMs with correct class names
LMK if you have any questions 🤗
| from perceptron.tensorstream.ops import ( | ||
| compute_mrope_pos_tensor, | ||
| modality_mask, | ||
| reconstruct_tensor_stream_from_compact_dict, | ||
| slice as ts_slice, | ||
| tensor_stream_token_view, | ||
| ) |
There was a problem hiding this comment.
hmm is this for streaming video data? I am not sure we can add it as is, we need to either safe import it or find a way to integrate it in self.video_processor
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| class PixelShuffleSiglip2VisionConfig(Siglip2VisionConfig): |
There was a problem hiding this comment.
we need to call it PerceptronVisionConfig with the model's name
| # Prepare positional embeddings grid: (1, embed_dim, h, w) | ||
| positional_embeddings = ( | ||
| self.position_embedding.weight.reshape(self.position_embedding_size, self.position_embedding_size, -1) | ||
| .permute(2, 0, 1) | ||
| .unsqueeze(0) | ||
| ) |
There was a problem hiding this comment.
we can prepare it once when model is init, no?
| mode = "bilinear" | ||
| align_corners = False | ||
| antialias = True |
There was a problem hiding this comment.
used only once, we don't need to save in var and can move directly to interpolate call
| class RopeScaling(TypedDict, total=False): | ||
| rope_type: str | ||
| factor: float | ||
| mrope_section: list[int] | ||
| mrope_interleaved: bool | ||
| low_freq_factor: float | ||
| high_freq_factor: float | ||
| original_max_position_embeddings: int |
There was a problem hiding this comment.
let's have rope params in config.text_config.rope_scaling. We are planning a huge rope refactor, and until then it will be easier to follow the existing format
There was a problem hiding this comment.
comment still relevant. We will be merging the refactor next week hopefully, and we will add a typing hint with typed dict there. For now let's follow the standard way and type hint as simple dict
| def apply_chat_template( | ||
| self, | ||
| messages: list[dict[str, Any]], | ||
| tokenize: bool = False, | ||
| add_generation_prompt: bool = False, | ||
| **kwargs, | ||
| ) -> Any: | ||
| return self.tokenizer.apply_chat_template( | ||
| messages, tokenize=tokenize, add_generation_prompt=add_generation_prompt, **kwargs | ||
| ) |
| def __init__( | ||
| self, | ||
| tokenizer: AutoTokenizer, | ||
| config: IsaacConfig, |
There was a problem hiding this comment.
the processor class does not expect config and needs to be initialized with tokenizer ,image-processor, kwargs
| super().__init__(config) | ||
| self.layers = torch.nn.ModuleList( | ||
| [Qwen3DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] | ||
| ) | ||
| self.rotary_emb = IsaacRotaryEmbedding(config, device=self.device) |
There was a problem hiding this comment.
if the text model is identical to qwen3, we need to init with AutoModel.from_config(text_config). Otherwise we need to add IsaacTextModel by copying with modular
|
@zucchini-nlp just pushed a round of updates addressing your feedback; give it another look when you get a moment |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks for iterating! I gave a quick look at the first ~1500 lines.
We are currently insisting on following certain standards before merging models, as it causes many issues otherwise. Overall the code looks good but I feel like the models still needs more standardization. So I left some comments and I will review the rest on Monday.Not sure if we can keep the tensorstream, let me see in detail how it is used
| try: | ||
| from torchvision.transforms.v2 import functional as TVF | ||
| except ImportError: | ||
| TVF = None |
There was a problem hiding this comment.
this is usually imported with is_torchvision_available() from `from ...utils import is_torchvision_available. If it is used only in the fast image processing file, then we don't even need to safe import :)
| # Vision preprocessing constants | ||
| VISION_MEAN = (0.5, 0.5, 0.5) | ||
| VISION_STD = (0.5, 0.5, 0.5) | ||
| VISION_SCALE = 1 / 255 |
There was a problem hiding this comment.
the mean and std are available as IMAGENET_STANDARD_MEAN and STD in image_utils. The other onecan just be assigned a var when needed. Let's not have global var
| do_rescale: bool | None | ||
| rescale_factor: float | None | ||
| do_normalize: bool | None | ||
| image_mean: float | Sequence[float] | None | ||
| image_std: float | Sequence[float] | None | ||
| do_convert_rgb: bool | None |
There was a problem hiding this comment.
these are already defined in base class, no need to duplicate
| slow_image_processor_class = "IsaacImageProcessor" | ||
|
|
||
| resample = PILImageResampling.BILINEAR | ||
| model_input_names = ["patches", "token_grids"] | ||
| valid_kwargs = IsaacImageProcessorKwargs | ||
| unused_kwargs = ["size", "do_center_crop", "crop_size"] |
There was a problem hiding this comment.
the class attr here should be all the image args with their default values, i.e. do_resize, size, image_mean etc. The attr slow_image_processor_class is not needed
| # Configuration | ||
| # ============================================================================ | ||
|
|
||
| MAX_PIXELS = 60_000_000 # 60‑megapixel ceiling ≈ 8200 × 7300 px |
There was a problem hiding this comment.
same, global vars are better assigned where it is used. In this case prob it's a cls attr of a fast image processor
| class RopeScaling(TypedDict, total=False): | ||
| rope_type: str | ||
| factor: float | ||
| mrope_section: list[int] | ||
| mrope_interleaved: bool | ||
| low_freq_factor: float | ||
| high_freq_factor: float | ||
| original_max_position_embeddings: int |
There was a problem hiding this comment.
comment still relevant. We will be merging the refactor next week hopefully, and we will add a typing hint with typed dict there. For now let's follow the standard way and type hint as simple dict
| # EventStreamProcessor parameters (for backward compatibility) | ||
| self.video_patch_size = vision_patch_size | ||
| self.vision_max_num_patches = vision_max_num_patches | ||
| self.vision_min_num_patches = vision_min_num_patches | ||
| self.pixel_shuffle_scale = pixel_shuffle_scale | ||
|
|
||
| # Vision normalization parameters | ||
| self.vision_rescale_factor = float(vision_rescale_factor) | ||
| self.vision_mean = _normalize_rgb_values(vision_mean, name="vision_mean") | ||
| self.vision_std = _normalize_rgb_values(vision_std, name="vision_std") | ||
|
|
||
| # Processing parameters | ||
| self.max_sequence_length = max_sequence_length | ||
| self.vision_token = vision_token |
There was a problem hiding this comment.
these are params from processing. Do we really need them for model code? I did not see where they are used and usually modeling does not need to know how processing happened
| # Processing parameters | ||
| self.max_sequence_length = max_sequence_length | ||
| self.vision_token = vision_token | ||
| self.vision_attn_implementation = vision_attn_implementation |
There was a problem hiding this comment.
not needed, the vision config has its own attn implementation attribute
| def get_text_config(self, *_, **kwargs) -> Qwen3Config: | ||
| # Accept optional decoder/encoder flags to align with HF composite configs | ||
| kwargs.pop("decoder", None) | ||
| kwargs.pop("encoder", None) | ||
| return self.text_config | ||
|
|
There was a problem hiding this comment.
i think the super get_text_config works well, or is it raising errors?
|
@zucchini-nlp just pushed another round of refinements based on the latest feedback. Would love for you to give it another pass when you get a chance |
|
Hi @zucchini-nlp, gentle ping on this PR when you get a chance. Thanks for your time! |
|
Hey @zucchini-nlp - just wanted to follow up again on this PR. It’s been a little while since the last round, and your review would help us move things forward. Appreciate it! |
| @@ -0,0 +1,2278 @@ | |||
| # Perceptron, Inc. Non-Production License | |||
There was a problem hiding this comment.
Missing date and usual format header
| from genesis.public.tensorstream.tensor_stream import ( | ||
| Event, | ||
| Stream, | ||
| TensorStream, | ||
| TextType, | ||
| VisionType, | ||
| create_stream, | ||
| group_streams, | ||
| ) | ||
| from genesis.public.tensorstream.tensor_stream_utils import ( | ||
| compute_mrope_pos_tensor, | ||
| modality_mask, | ||
| reconstruct_tensor_stream_from_compact_dict, | ||
| tensor_stream_token_view, | ||
| ) | ||
| from genesis.public.tensorstream.tensor_stream_utils import ( | ||
| slice as ts_slice, | ||
| ) |
There was a problem hiding this comment.
This is the biggest hurdle for now: if I understand correctly this is an if/else path for TensorStream. We don't want to add a new dependency to transformers (and here the import will simply fail).
| @property | ||
| def attn_implementation(self) -> str | None: | ||
| return self._attn_implementation | ||
|
|
||
| @attn_implementation.setter | ||
| def attn_implementation(self, value: str | None) -> None: | ||
| self._attn_implementation = value |
| AutoImageProcessor.register( | ||
| IsaacConfig, | ||
| fast_image_processor_class=IsaacImageProcessorFast, | ||
| exist_ok=True, | ||
| ) |
There was a problem hiding this comment.
This is not needed: mappings in processing_auto are enough
TensorStream is a core primitive for us; it underpins how we handle multimodal computation, and future open-source releases will include incremental improvements to its ergonomics and performance aligned with model updates. We see it as the right abstraction boundary for multimodal modelling. That said, we’ll be guarding the imports here to avoid issues on that front. As a heads up, we'll also following up with changes here that address feedback around pattern-reuse standards in transformers, feedback there is much appreciated. |
|
Thanks @philippguevorguian , then I will see if we can integrate it. I can't say for now if we'll be able to integrate it as our principles for modeling files is to not abstract too much and keep the models "hackable", meaning a user should be able to intervene at any given point of a model's forward, adding a module, modifying the processing, etc. But if the rest is more In particular attention classes that currently are relying on for instance For cross-document masking, combinations of existing utils should be sufficient, like |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Huuge work cleaning it up. It already looks much much better 🤩
We need to rebase on main since there were a few big refactors merged recently. I commented below what needs to be changed for them. And the main question I have is about padding, since we need to try and delegate pad/truncate to tokenizer's __call__ method. The rest is mostly nits about styling
| @@ -0,0 +1,128 @@ | |||
| <!--Copyright 2025 Perceptron, Inc. and The HuggingFace Inc. team. All rights reserved. | |||
There was a problem hiding this comment.
Oops, missed this on the prior pass; resolved
| Key implementation notes: | ||
|
|
||
| - **Packed vision attention** – `IsaacVisionEncoder` keeps track of per-image patch lengths and uses specialized attention | ||
| kernels with custom `AttentionMaskConverter` utilities so the decoder only applies attention to real patches while supporting | ||
| both FlashAttention and SDPA. | ||
| - **TensorStream-first pipeline** – `IsaacProcessor` converts chat templates into multimodal streams where every image gets a | ||
| dedicated event with spatial metadata. `IsaacModel` can embed that stream directly (using `embed_stream`) and automatically |
There was a problem hiding this comment.
maybe the doc is a bit out-dated with all the changes. Let's update it before merging
There was a problem hiding this comment.
Updated the doc here with all of the changes
| ("imagegpt", ("ImageGPTImageProcessor", "ImageGPTImageProcessorFast")), | ||
| ("instructblip", ("BlipImageProcessor", "BlipImageProcessorFast")), | ||
| ("internvl", ("GotOcr2ImageProcessor", "GotOcr2ImageProcessorFast")), | ||
| ("isaac", (None, "IsaacImageProcessorFast")), |
There was a problem hiding this comment.
oops, this needs a rebase on main, we merged a huge refactor yesterday. The mapping is now a `dict[dict[str, str]]]
There was a problem hiding this comment.
Made the update with the TorchVisionBackend
| if TYPE_CHECKING: | ||
| # This significantly improves completion suggestion performance when | ||
| # the transformers package is used with Microsoft's Pylance language server. | ||
| PROCESSOR_MAPPING_NAMES: OrderedDict[str, str | None] = OrderedDict() | ||
| else: | ||
| PROCESSOR_MAPPING_NAMES = OrderedDict( | ||
| [ |
There was a problem hiding this comment.
accidentally removed TYPE_CHECKING?
There was a problem hiding this comment.
Yes, merge mistake. Reverted
| @@ -0,0 +1,330 @@ | |||
| # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨 | |||
There was a problem hiding this comment.
this file contents need to be moved to image_processing_isaac.py to align with the refactor. I think rebase and modular_convert will do eveything for you
There was a problem hiding this comment.
Yep, after updating the processing related code this was changed
| else: | ||
| mm_token_type_ids = mm_token_type_ids.to(device=inputs_embeds.device, dtype=torch.long) |
There was a problem hiding this comment.
same comment about moving to devices
There was a problem hiding this comment.
Removed device handling logic here
| if isinstance(attention_mask, dict): | ||
| attention_mask = attention_mask.get("full_attention", next(iter(attention_mask.values()))) |
There was a problem hiding this comment.
do we want to get non-full attention? That would prob not work well to build position ids
There was a problem hiding this comment.
Made this more opinionated to be all full attention
| vision_patch_attention_mask = ( | ||
| image_patch_attention_mask if vision_patch_attention_mask is None else vision_patch_attention_mask | ||
| ) |
There was a problem hiding this comment.
there are two args that look suspiciously similar, are they the same thing with different names?
There was a problem hiding this comment.
Standardized this on image_patch_attention_mask
| if position_ids is None or position_ids.ndim == 2: | ||
| position_ids = self._prepare_position_ids_for_generation( | ||
| input_ids, | ||
| { | ||
| "input_ids": input_ids, | ||
| "attention_mask": attention_mask, | ||
| "past_key_values": past_key_values, | ||
| "mm_token_type_ids": mm_token_type_ids, | ||
| "vision_token_grids": vision_token_grids, | ||
| "vision_token_offsets": vision_token_offsets, | ||
| "vision_token_lengths": vision_token_lengths, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
this already happens when we call generate(), so we shouldn't need to call manually each decode step
| def get_input_embeddings(self) -> nn.Module: | ||
| return self.model.get_input_embeddings() | ||
|
|
There was a problem hiding this comment.
not needed, base class can work it out when it is inside model
|
Also would be great to fix the CI before we request core maintainer's review (the last step before merging) |
…#16) * fix: use torchvisionbackend * fix: import IsaacImageProcessor * fix: resample not interpolation * style: orgranize import * chore: auto processing auto from main * feat: register isaac image processor according to new convention * fix: update to new config style * fix: correct pix2struct import * docs: initial doc update * feat: re-register isaac processor to auto * refactor: move max_posiiton_embeddings to isaac config refactor: move max_position_embeddings to isaac config * TEMP pop! * docs: update date * style: remove removed attr * style: add config attr for completeness * style: drop redundant merge_with_config_defaults * style: remove redundant positions ids handling * refactor: rely on base class for setting embeddings * fix: always use full attention * style: clarify padding logic * chore: remove stale artifcat * fix: kwargs name! * refactor: isolate custom padding to image processor pad method * feat: no device movement * style: align with transformers standard for loading rope params * refactor: drop unneeded arg filter * feat: compile check image presence instead * docs: add clarifying comment for keeping empty tensors * refactor: move broadcasting to forward WIP * style: use new layer validation functionality * feat: update embedding access mixin to support nested paths! * chore: convert artifacts * refactor: inline build batch * style: drop duplicate test * test: polygons * feat: polygon extraction * test: polygon generation test
* fix: use torchvisionbackend * fix: import IsaacImageProcessor * fix: resample not interpolation * style: orgranize import * chore: auto processing auto from main * feat: register isaac image processor according to new convention * fix: update to new config style * fix: correct pix2struct import * docs: initial doc update * feat: re-register isaac processor to auto * refactor: move max_posiiton_embeddings to isaac config refactor: move max_position_embeddings to isaac config * TEMP pop! * docs: update date * style: remove removed attr * style: add config attr for completeness * style: drop redundant merge_with_config_defaults * style: remove redundant positions ids handling * refactor: rely on base class for setting embeddings * fix: always use full attention * style: clarify padding logic * chore: remove stale artifcat * fix: kwargs name! * refactor: isolate custom padding to image processor pad method * feat: no device movement * style: align with transformers standard for loading rope params * refactor: drop unneeded arg filter * feat: compile check image presence instead * docs: add clarifying comment for keeping empty tensors * refactor: move broadcasting to forward WIP * style: use new layer validation functionality * feat: update embedding access mixin to support nested paths! * chore: convert artifacts * refactor: inline build batch * style: drop duplicate test * test: polygons * feat: polygon extraction * test: polygon generation test * style: align with new config implementation style * style: add date * chore: remove all isaac image processor fast * test: new image processor test setup * feat: special path for uint8 interp * test: update image processor test for torchvision backend * fix: don't mutate nested outputs; copy image index * style: drop redundant copy * chore: make fix-repo * chore: convert artifacts * chore: fix docstring
* fix: use torchvisionbackend * fix: import IsaacImageProcessor * fix: resample not interpolation * style: orgranize import * chore: auto processing auto from main * feat: register isaac image processor according to new convention * fix: update to new config style * fix: correct pix2struct import * docs: initial doc update * feat: re-register isaac processor to auto * refactor: move max_posiiton_embeddings to isaac config refactor: move max_position_embeddings to isaac config * TEMP pop! * docs: update date * style: remove removed attr * style: add config attr for completeness * style: drop redundant merge_with_config_defaults * style: remove redundant positions ids handling * refactor: rely on base class for setting embeddings * fix: always use full attention * style: clarify padding logic * chore: remove stale artifcat * fix: kwargs name! * refactor: isolate custom padding to image processor pad method * feat: no device movement * style: align with transformers standard for loading rope params * refactor: drop unneeded arg filter * feat: compile check image presence instead * docs: add clarifying comment for keeping empty tensors * refactor: move broadcasting to forward WIP * style: use new layer validation functionality * feat: update embedding access mixin to support nested paths! * chore: convert artifacts * refactor: inline build batch * style: drop duplicate test * test: polygons * feat: polygon extraction * test: polygon generation test * style: align with new config implementation style * style: add date * chore: remove all isaac image processor fast * test: new image processor test setup * feat: special path for uint8 interp * test: update image processor test for torchvision backend * fix: don't mutate nested outputs; copy image index * style: drop redundant copy * chore: make fix-repo * chore: convert artifacts * chore: fix docstring * style: update imports * refactor: unify vision attention mask test: unified vision attention mask fix: correct arg name * fix: standardize on the image attention mask name
8a542f4 to
caf377c
Compare
| if all(len(sample_images) == 0 for sample_images in images): | ||
| tensors = { | ||
| "vision_patches": torch.zeros((batch_size, 0, 0, 0), dtype=torch.float32), | ||
| "vision_patch_attention_mask": torch.zeros((batch_size, 0, 0), dtype=torch.long), | ||
| "vision_token_grids": torch.zeros((batch_size, 0, 2), dtype=torch.long), | ||
| } | ||
| return BatchFeature(data=tensors, tensor_type=return_tensors) |
There was a problem hiding this comment.
weird, inside processor we shouldn't be routing empty lists. I'll go check processing code
| for key in ("use_cache", "rope_theta", "max_position_embeddings"): | ||
| kwargs.pop(key, None) | ||
|
|
There was a problem hiding this comment.
ig this comes from saved checkpoints which we don't want to re-save?
|
|
||
|
|
||
| @auto_docstring | ||
| class IsaacModel(Qwen3PreTrainedModel): |
There was a problem hiding this comment.
let's instead copy from qwen-VL so identical part can be copied. As mentioned, I am seeing a few similar part sch as get_placeholder_mask and compute3d_position_ids. Other part like forward need new args so can be copied and adapted
IMO things could be simplified
There was a problem hiding this comment.
Inherit from Qwen3VL now after big refactor. After going after a few approaches here, delegating logic of model helpers to the qwen3vl stack generally leads to increased complexity, as the Isaac image truncation logic differs from qwen's policy
* clean up and rearrange code * fix: allow cutting through any image span * test: cropping middle of arbitrary image * test: drop stale / redundant tests * test: drop flash attn debug test * test: drop stale helpers * fix: restore isaac processor compatibility * test: use public api in isaac integration tests * fix: restore isaac generation outputs * simplif * style: simplify 2 * test: drop redundant tests * test: drop more low level image processing tests * test: no plan to define 4 channel numpy processing * test: drop image processor properties test * test: focus image processing tests * test: drop unneeded input trimming helper, chat template now omits the newline by default * tests: enable Isaac tokenizer defaults coverage * isaac: support assistant mask chat template tests * tests: cover Isaac image placeholder expansion * tests: patch Isaac chat template for assistant masks * tests: use Isaac default assistant mask template * tests: align Isaac image batching coverage * tests: drop unneeded utilities / low-level tests * style: isaacvisionmodel not isaacvisiontransformer * tests: clean up imports * wip 1 * style: drop now unneeded check_argument_for_proper_class override * test: don't skip where assisted decoding works * style: inherit from closer base class * style: lint * chore: convert artifacts --------- Co-authored-by: raushan <raushan@huggingface.co>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, isaac |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Oke, I have a few questions on processing and 2D RoPE based on prev suggestions. I left questions below, would be great to incorporate the suggestions from before if possible or add a small comment in code about the main diff is [....]
Also looked at test file, and seeing a pr ckpt used for integration tests. Do you know what will be the plan after merging the model? (do we merge hub-changes or do we create a new hub repo, so we can clean-up the testing ckpt before a final core review)
| size: SizeDict, | ||
| **kwargs, | ||
| ) -> torch.Tensor: | ||
| if image.dtype == torch.uint8: |
There was a problem hiding this comment.
a small explanation comment would be nice on uint8
| prompt = processor.apply_chat_template( | ||
| conversation, | ||
| tokenize=True, | ||
| return_dict=True, | ||
| add_generation_prompt=True, | ||
| return_tensors="pt", | ||
| ) | ||
|
|
||
| inputs = processor(text=prompt, images=images, return_tensors="pt").to(model.device) |
There was a problem hiding this comment.
the output from apply_chat_template is already a dict of inputs
| f"vision_config must be a dict or an IsaacVisionConfig instance, got {type(self.vision_config).__name__}." | ||
| ) | ||
|
|
||
| self.vision_rescale_factor = float(self.vision_rescale_factor) |
There was a problem hiding this comment.
nit: it will be float and otherwise error out, as per type annotation. So either we need to annotate (int | float) or we can just delete this line?
| hidden_states=encoder_outputs.hidden_states, | ||
| attentions=encoder_outputs.attentions, |
There was a problem hiding this comment.
these two will be auto-filled by decorators, can delete now
| if image_metadata is None: | ||
| pixel_shuffle_scale = self.config.vision_config.pixel_shuffle_scale_factor | ||
| downsampled_height = flat_image_grid_thw[:, 1].div(pixel_shuffle_scale, rounding_mode="floor") | ||
| downsampled_width = flat_image_grid_thw[:, 2].div(pixel_shuffle_scale, rounding_mode="floor") | ||
| lengths = downsampled_height * downsampled_width | ||
| offsets = torch.zeros_like(lengths) | ||
| else: | ||
| torch_compilable_check( | ||
| image_metadata.shape[:2] == image_grid_thw.shape[:2], | ||
| "IsaacModel.get_image_features expects batch-major metadata aligned with `image_grid_thw`.", | ||
| ) | ||
| offsets = image_metadata[active_slot_mask][:, 0] | ||
| lengths = image_metadata[active_slot_mask][:, 1] | ||
|
|
||
| image_features = tuple( | ||
| projected_features[image_idx, offset : offset + length] | ||
| for image_idx, (offset, length) in enumerate(zip(offsets.tolist(), lengths.tolist(), strict=True)) | ||
| ) |
There was a problem hiding this comment.
maybe shorter?
if image_metadata is not None:
offsets = [...]
projected_features = tuple([... for offset in offsets])
return pooler_output=projected_features
| generated_ids = outputs.sequences[:, inputs["input_ids"].shape[1] :] | ||
| generated_text = self.processor.batch_decode(generated_ids, skip_special_tokens=True)[0] | ||
| expected_fragment = "The image is a close-up photograph of a red cross symbol." | ||
| assert expected_fragment in generated_text |
There was a problem hiding this comment.
self.assertListEqual and self.assertEqual are much better for us when checking CI outputs, can you fix everywhere
| "sum": 92510.4578057677, | ||
| "l2_norm": 3490.2146142251, | ||
| } | ||
| assert logit_stats == expected_logit_stats |
There was a problem hiding this comment.
torch.allclose would be better, we always get numerical precision diffs due to hardware or pckg versions
| sample_lengths = [single_input["input_ids"].squeeze(0).shape[0] for single_input in single_inputs] | ||
| for i, (single_input, batch_ids, single_len) in enumerate(zip(single_inputs, batch_input_ids, sample_lengths)): | ||
| single_ids = single_input["input_ids"].squeeze(0) | ||
| torch.testing.assert_close(batch_ids[-single_len:], single_ids) | ||
|
|
||
| batch_modality_row = batch_inputs["mm_token_type_ids"][i] |
There was a problem hiding this comment.
testing processor outputs which imo belongs in test_processor_isaac
Let's move if we need it, not sure if it is already being tested there. If duplicate or just already tested in similar unitest, then we can simply delete 😅
| max_new_tokens = 256 | ||
| dtype = torch.bfloat16 | ||
|
|
||
| def setUp(self): |
There was a problem hiding this comment.
same comments on this IntegrationTest
| BASE_MODEL_ID = os.environ.get("ISAAC_TEST_MODEL_ID", "PerceptronAI/Isaac-0.1-Base") | ||
| BASE_MODEL_REVISION = os.environ.get("ISAAC_TEST_MODEL_REVISION", "refs/pr/3") or None | ||
| LOCAL_CHECKPOINT = os.environ.get("ISAAC_TEST_MODEL_PATH") |
Perceptron Isaac Implementation
Perceptron released open weight models Isaac-0.1 and Isaac-0.1-Base a 2B dense model for perception.