Inference workflow in compile mode using transformers.pipeline()

Arunima_Ghosh · August 26, 2024, 2:11pm

Hi,

I am trying to run an inference workflow of a Llama model in compile mode using transformers.pipeline(). I am using the following line of codes to run the inference workflow in compile mode:

model = LlamaForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-70B",use_cache=True,device_map='auto')
 tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B",use_cache=True,truncation=True,padding="max_length",max_length=64,return_tensors="pt")
 tokenizer.pad_token = tokenizer.eos_token
 tokenizer.padding_side = "left"
    model = torch.compile(model)
    pipeline = transformers.pipeline(
                         "text-generation",
                           model=model,
                           model_kwargs={"torch_dtype": torch.bfloat16},
                           tokenizer = tokenizer,
                           device_map="auto",
                            )		
    generation_config = {
                            "num_beams":1,
                            "max_new_tokens":32,
                            "do_sample":True,
                             "use_cache":True,
                              }
    outputs = pipeline(input_prompt,**generation_config)

It is expected that torch.compile() should compile the model, print some compilation messages but I am not getting any compilation messages.
There is no error message but not getting any compilation messages.

Can you please guide me what might be wrong in code? How can I run an inference workflow of llama model in compile mode using transformers.pipeline()?

Thanks