Hi,
I am trying to run an inference workflow of a Llama model in compile mode using transformers.pipeline(). I am using the following line of codes to run the inference workflow in compile mode:
model = LlamaForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-70B",use_cache=True,device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B",use_cache=True,truncation=True,padding="max_length",max_length=64,return_tensors="pt")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = torch.compile(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
model_kwargs={"torch_dtype": torch.bfloat16},
tokenizer = tokenizer,
device_map="auto",
)
generation_config = {
"num_beams":1,
"max_new_tokens":32,
"do_sample":True,
"use_cache":True,
}
outputs = pipeline(input_prompt,**generation_config)
It is expected that torch.compile() should compile the model, print some compilation messages but I am not getting any compilation messages.
There is no error message but not getting any compilation messages.
Can you please guide me what might be wrong in code? How can I run an inference workflow of llama model in compile mode using transformers.pipeline()?
Thanks