-
Notifications
You must be signed in to change notification settings - Fork 77
Description
import xfastertransformer as xft
from transformers import AutoTokenizer, TextStreamer
MODEL_PATH="~/llms/model_xft"
TOKEN_PATH="~/llms/model_hf"
INPUT_PROMPT="what is apple?"
tokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=True, padding_side="left", trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)
input_ids = tokenizer(INPUT_PROMPT, return_tensors="pt", padding=False).input_ids
model = xft.AutoModel.from_pretrained(MODEL_PATH, dtype="bf16")
generated_ids = model.generate(input_ids, max_length=200, streamer=streamer)
[INFO] SeqLen > FLASH_ATTN_THRESHOLD(8192) will enable FlashAttn.
[INFO] ENABLE_TUNED_COMM is enabled for faster reduceAdd.
[INFO] ENABLE_KV_TRANS is enabled for faster decoding.
[INFO] SINGLE_INSTANCE MODE.
Process finished with exit code 132 (interrupted by signal 4: SIGILL)