Text classification unexpectedly slow

The points where your benchmarking indicates slowdowns are CUDA synchronization points. All host-GPU scalar copies (e.g. accessing individual elements like loss.data[0]), or transferring tensors that aren’t in pined memory (e.g. .cuda()) will make the CPU and GPU synchronize. Your CPU quickly gets through the model definition, and then stop in one of these points, waiting for the GPU to finish computing the forward pass.

If you want to reliably benchmark the compute time of your model do this:

torch.cuda.synchronize()
start = # get start time
output = model(input)
torch.cuda.synchronize()
end = # get end time

Can you expect what “unexpectedly slow” means? What is the runtime and what did you expect? It seems that you’re using a convolutional kernel size of 3x300, and that will be incredibly costly to compute (especially with 200 output channels).

1 Like