Hi,
I’m basically struggling to understand how to be able to use todays blogpost on accelerating LLM throughput.
I want to make sure I’m understanding or doing it properly but what I think I need to do is:
- clone the repo
- choose a model checkpoint to convert into a faster version using
export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
./scripts/prepare.sh $MODEL_REPO
- Then run that model using:
python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"
If this is wrong, I’d the proper way to do this because I get a long download for the weights and then an error on line 3 above regarding model.pth and the checkpoints
Furthermore, I don’t really understand if I can convert any open source llm in this way or just the ones tested etc; I saw some mention of mistral but I guess I dont really know the full process to be confident enough. Also might be a dumb question but I won’t see the results of this efficiency without an A100 too right? Appreciate any guidance.
If anyone could walk me through how I’m supposed to utilize this methodology the right way, I’d appreciate that immensely.
Documentation: Blog Post
Github: GitHub - pytorch-labs/gpt-fast: Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.