BERT Inference fails to run on GPU

I have a setup something like this:

model = BERT()
model.to(device)
model.eval()
result = []
for data in batch:
input_ids = data.to(device)
token_ids = data.to(device)
attention_mask = data.to(device)
x = model(input_ids, token_ids, attention_mask)
y = x.cpu().detach().numpy()
result+= covert_to_string(y)

I don’t see GPU being used, I checked using nvidia-smi also. The GPU memory is about 75% being used but not the GPU cores for computation. What am I doing wrong?

How are you checking if the GPU is used for computation? To validate its usage you could check the util. in nvidia-smi or check the used CUDA kernels in a profiler. In general, if the model, input and output data are all on the GPU, the computation will also be performed on the GPU as PyTorch won’t move data to the CPU without explicit to() calls.