Identifying if gpu cores are used

Hi,

I have got the setup described in the image.

Either on windows or wsl2. altough the cuda is available, I don’t manage to run llm inference on gpu.
It looks like the models are loaded on gpu memory, but It seems using cpu for inference.

Is there a way to monitor gpu activity ?

I setted up the pip with extra url cuda torch version

nvidia-smi should show the activity. Move a large tensor to the device and execute a matmul in a loop after you confirmed the tensor was moved to the GPU by checking its .device attribute as a quick check.

Yeah, it’s possible for a model to be loaded onto the GPU (so memory usage shows up) but still have inference happen on the CPU if the inputs or some parts of the model aren’t on the GPU.

To monitor actual GPU usage, you can use:

  • Windows: Task Manager > Performance tab > GPU section, but this sometimes doesn’t show detailed compute usage.
  • WSL2 or Linux: Run nvidia-smi in the terminal. It shows GPU memory usage, running processes, and GPU utilization (look at the “Volatile GPU-Util” column).
  • For a more detailed view, try nvidia-smi dmon or use tools like nvtop.

Also, make sure both your model and your inputs are on the same device (cuda). A common issue is forgetting to move the input tensors to the GPU before inference:

model = model.to(“cuda”)
input = input.to(“cuda”)
output = model(input)

If anything’s still unclear, feel free to share a code snippet; it might help track it down.