Skip to content

feat: Add basic text generation support with native models, initially supporting Gemma3#12392

Merged
comfyanonymous merged 39 commits intoComfy-Org:masterfrom
kijai:gemma3
Feb 19, 2026
Merged

feat: Add basic text generation support with native models, initially supporting Gemma3#12392
comfyanonymous merged 39 commits intoComfy-Org:masterfrom
kijai:gemma3

Conversation

@kijai
Copy link
Copy Markdown
Collaborator

@kijai kijai commented Feb 10, 2026

This adds generic text generation support that currently tested and works with:

  • Gemma3 12B
  • Gemma3 4B (needs new model file to support images)

Generation itself also works with at least Qwen VL 2.5, but the model loading part needs figuring out how to handle the lm_head weight so that it's not loaded if text generation isn't used, this isn't an issue with Gemma3 as it doesn't have separate lm_head.

For example with LTX2, same Gemma3 12B model can be used as text encoder and prompt enhancer:

image

Comment thread comfy/text_encoders/llama.py Outdated
Comment thread comfy/text_encoders/lt.py Outdated
Comment thread comfy/sd1_clip.py
return self.transformer.load_state_dict(sd, strict=False, assign=getattr(self, "can_assign_sd", False))

def generate(self, tokens, do_sample, max_length, temperature, top_k, top_p, min_p, repetition_penalty, seed, stop_tokens=[]):
if isinstance(tokens, dict):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to handle dicts?

Comment thread comfy/sd1_clip.py Outdated
return {}

def decode(self, token_ids, skip_special_tokens=True):
if torch.is_tensor(token_ids):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make things consistent and easier the token_ids should always be in a single data type. If they can be both lists or tensors it makes things less maintainable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it can always stay as a list of ints, that's cleaner. Fixed.

comfy.ops.uncast_bias_weight(module, weight, None, offload_stream)
return x

def generate(self, embeds=None, do_sample=True, max_length=256, temperature=1.0, top_k=50, top_p=0.9, min_p=0.0, repetition_penalty=1.0, seed=42, stop_tokens=[], initial_tokens=[], execution_dtype=None, min_tokens=0):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you get these default numbers?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are just placeholders, actual defaults used come from the node used.

Comment thread comfy/text_encoders/lt.py
images = []
else:
samples = image.movedim(-1, 1)
total = int(896 * 896)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 896?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the default for Gemma3, as it states on their model page:
Images, normalized to 896 x 896 resolution and encoded to 256 tokens each

Comment thread comfy/text_encoders/lt.py
embed_count = 0
for r in text_tokens:
for i, token in enumerate(r):
if token[0] == 262144 and embed_count < len(images):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 262144?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the token id for <image_soft_token>

Comment thread comfy/text_encoders/lt.py Outdated
def generate(self, tokens, do_sample, max_length, temperature, top_k, top_p, min_p, repetition_penalty, seed):
tokens_only = [[t[0] for t in b] for b in tokens]
embeds, _, _, embeds_info = self.process_tokens(tokens_only, self.execution_device)
embeds = comfy.utils.normalize_image_embeddings(embeds, embeds_info, target_std=0.0156)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 0.0156?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this one could be done better, changed it to the proper calculation.

@comfyanonymous comfyanonymous merged commit 6d11cc7 into Comfy-Org:master Feb 19, 2026
12 checks passed
@zwukong
Copy link
Copy Markdown

zwukong commented Feb 19, 2026

So great PR ,very useful,we don't need to load and call VL again anymore. And how about Qwen VL 2.5 or 3 using video as input,describe video

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants