# test_cuda.py
import torch
from datetime import datetime
for i in range(10):
x = torch.randn(10, 10, 10, 10)
t1 = datetime.now()
x.cuda()
print(i, datetime.now() - t1)
For pre-built pytorch the result is fast enough but for a more complicated example(which uses something like my_model.cuda()) i get the no kernel image error: RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device.
After that i’ve run the simple test again and now i’ve got very slow result on gpu :
Found GPU0 GeForce 920MX which is of cuda capability 5.0.
PyTorch no longer supports this GPU because it is too old.
warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
0 0:00:02.086585
1 0:00:00.000071
2 0:00:00.000053
3 0:00:00.000051
4 0:00:00.000051
5 0:00:00.000052
6 0:00:00.000065
7 0:00:00.000052
8 0:00:00.000051
9 0:00:00.000052
So what sould i do next? probably changing my laptop?!
I didn’t look into the .cuda() implementation, but I guess there must by an cudaMemcpy somewhere.
Memory copy between CPU and GPU just takes much time (it should be hardware interruption) and is not likely to be accelerated by several lines of code.
Sorry, but may I ask where you got these numbers from? I’m not questioning your theory. I just personally found the original numbers reasonable, and am wondering if you saw better numbers using a previous version or another framework.
# test_cuda.py
import torch
from datetime import datetime
for i in range(10):
x = torch.randn(10, 10, 10, 10)
t1 = datetime.now()
x.cuda()
print(i, datetime.now() - t1)
and the result for my compiled torch on cuda 9.1.85 is:
Found GPU0 GeForce 920MX which is of cuda capability 5.0.
PyTorch no longer supports this GPU because it is too old.
warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
0 0:00:02.086585
1 0:00:00.000071
2 0:00:00.000053
3 0:00:00.000051
4 0:00:00.000051
5 0:00:00.000052
6 0:00:00.000065
7 0:00:00.000052
8 0:00:00.000051
9 0:00:00.000052
but same code for cpu:
for i in range(10):
x = torch.randn(10, 10, 10, 10)
t1 = datetime.now()
#x.cuda()
print(i, datetime.now() - t1)
These are fractions of a second (0.5 microseconds). Given that you have to do this only 1 time per iteration (or 2 times if you have a target array), this is super negligible. Or in other words, after 200 hundred iterations (which probably take minutes to hours depending on your architecture) you lose 1 second