PyTorch Acceleration GPT-Fast

khalid-sharma · December 2, 2023, 3:59am

Hi,

I’m basically struggling to understand how to be able to use todays blogpost on accelerating LLM throughput.

I want to make sure I’m understanding or doing it properly but what I think I need to do is:

clone the repo
choose a model checkpoint to convert into a faster version using

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
./scripts/prepare.sh $MODEL_REPO

Then run that model using:

python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"

If this is wrong, I’d the proper way to do this because I get a long download for the weights and then an error on line 3 above regarding model.pth and the checkpoints

Furthermore, I don’t really understand if I can convert any open source llm in this way or just the ones tested etc; I saw some mention of mistral but I guess I dont really know the full process to be confident enough. Also might be a dumb question but I won’t see the results of this efficiency without an A100 too right? Appreciate any guidance.

If anyone could walk me through how I’m supposed to utilize this methodology the right way, I’d appreciate that immensely.

Documentation: Blog Post
Github: GitHub - pytorch-labs/gpt-fast: Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

marksaroufim · December 2, 2023, 6:33am

The implementation is meant to be a hacky copy and paste one so right now it mostly supports Llama but it should be easy to support other implementations

What’s the error you run into when you call generate? You didn’t paste any logs

khalid-sharma · December 2, 2023, 6:54am

The error I have currently is related to not yet being approved by Meta via hugging face to use the templates, so was asking specifically about the process for when I am approved.

I understand that these are quantization methods and speed up methods so is the idea that once I have access I can use these files to make small faster models that I can reupload publicly to Huggingface for example or LM Studio?

marksaroufim · December 2, 2023, 7:03am

As long as you use the same kind of state dict as the one for llama2 you should be good