How to Load Llama-3.3-70B-Instruct Model in Float8 Precision?

Hi everyone,

I am currently using the following code snippet to load the Llama-3.3-70B-Instruct model in BF16 precision:

import transformers
import torch

model_id = "meta-llama/Llama-3.3-70B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

However, I want to load this model in Float8 (FP8) precision instead of BF16 or FP16 to optimize performance and memory usage.

Could someone guide me on:

  1. Whether it’s possible to load the Llama-3.3-70B-Instruct model with FP8 precision using the Hugging Face Transformers library?
  2. Any necessary modifications to the above code or additional libraries required for enabling FP8 precision?

Thank you in advance for your help!