How to use transfomers and accelerate library to improve model performance

Hi!

I’m trying to make the same from this video https://www.youtube.com/watch?v=l21rWfwjUbw . You can see the final implementation of the code in the minute 9:31 of the video if you want to jump all the conversation.

I’m trying to make the same but with running deepseek r1. So here is my code now:

from transformers import  AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    #device_map="auto",
    offload_state_dict=True,
)

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

user_input = input("Insert your query: ")
inputs = f"<|im_start|>user\n{user_input}<|im_end|>\n<|im_start|>assistant\n"
encoded_inputs = tokenizer(inputs, return_tensors="pt").to(0)

with torch.no_grad():
    outputs = model(**encoded_inputs)

token_id = outputs.logits[0][-1].argmax()
answer = tokenizer.decode([token_id])

print(answer)

I’m using the 1.5B model now because it’s more convinent to load and do my programming more easily, but after success I want to try this with more larger models like 32b or 70b. And see if I can use speculative decoding also with a 3b or 7b in the long run.

I made some adjustment to the code so deepseek can work with it, but I’m having this issue when running:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

I’m using the ROCm version of pytorch. But I have a nvidia card also, so not sure if this could be help to generate the issue. But It seems the main issue is with the pipe between gpu and cpu.

Any help in fixing this or understanding more pytorch would be much appreciated!

EDIT: btw the commented line for device_map=“auto” is because the code crashes if I use that. here is the error log of that:

RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Do you see any issues using the auto device on your NVIDIA GPU?

Hi @ptrblck

For that should I made another python environment with cuda version of PyTorch installed ? I had downloaded the ROCm package but I really want to be able to use both gpus soon. There is a way to detect multiple gpus and be able to select a specific one in my code ? I really want to use both in the future, but not so sure if that’s possible in the current state of the art. I image it could be made using a worker node and a master node (one with cuda and another with ROCm), but I’m not so sure if I can just detect also my nvidia card with the ROCm package. Is there someway to do that ? I could compile PyTorch if it’s needed but I can’t found any documentation about making this kind of stuff in a single computer, and without using containers.

No, I don’t think you can mix-and-match rocm with CUDA, at least I’m unaware how the code could be compiled with both as the guards in the backends define which code path will be used.

Okey @ptrblck thanks for the reply! I’m going to make a virt env with cuda using the same code and let you know if it works :slight_smile:

Hi @ptrblck

The code it does work just fine with my nvidia card, just at It is, but my amd gpu is giving me the same error. I tried to make some adjustment to the code, so now it’s like this:

from transformers import  AutoModelForCausalLM, AutoTokenizer
import torch

if torch.cuda.is_available():
  print("You have CUDA  or ROCm support :D")
  device = torch.device("cuda")

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    #device_map="auto",
    device_map=device,
    offload_state_dict=True,
)

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

user_input = input("Insert your query: ")
inputs = f"<|im_start|>user\n{user_input}<|im_end|>\n<|im_start|>assistant\n"
encoded_inputs = tokenizer(inputs, return_tensors="pt").to(0)

with torch.no_grad():
    outputs = model(**encoded_inputs)

token_id = outputs.logits[0][-1].argmax()
answer = tokenizer.decode([token_id])

print(answer)

As you can see I just try to set the device manually, but nevertheless it still giving me the same error as before. What should I do ? I really want to use my amd gpu because it’s better than my old nvidia one.

EDIT: BTW I made two diffrent enviroments in python, one with pytorch with ROCm support and another with CUDA support. I’m using the latests ROCm and latest CUDA in my system.

It’s great to hear your NVIDIA GPU runs fine and thanks for confirming! I’m unfortunately not familiar with the AMD stack and thus don’t know why it’s failing or how to debug the issue.

Okey, thanks for let me know. I’ll do my research and if I think is a bug, I’ll send it to the issues in gituhub :slight_smile: .

Nevertheless thanks for you help !

1 Like