Hey,
so, I am trying to speed up my code using fp16 computation, with autocast.
I wrote two small code examples (with and without autocast).
with:
from torch.cuda.amp import autocast
%%time
for batch,(X,y) in enumerate(dltr):
with autocast():
pred = model(X)
print('Memory Allocated: %0.2fMB, Memory Reserved: %0.2fMB \n'\
%(torch.cuda.memory_allocated()/1e6, torch.cuda.memory_reserved()/1e6))
and without:
%%time
for batch,(X,y) in enumerate(dltr):
pred = model(X)
print('Memory Allocated: %0.2fMB, Memory Reserved: %0.2fMB \n'\
%(torch.cuda.memory_allocated()/1e6, torch.cuda.memory_reserved()/1e6))
Comparing the performance of these two timewise and memorywise yields these outputs:
with autocast:
Memory Allocated: 389.42MB, Memory Reserved: 2466.25MB
Wall time: 26.8 s
without:
Memory Allocated: 689.19MB, Memory Reserved: 2466.25MB
Wall time: 26.9 s
The input shape of X is (1024, 5, 5, 512)
.
As expected, using autocast yields (very) roughly a halving of the memory, but no speedup. I also tested this using multiple timing loops (%%timeit). In all test runs, not using autocast is approximately the same speed. Am I doing sth. wrong here, or is it just that this does not always yield a considerable speedup?
Given these results, it seems, the only thing I can expect is to run my model with twice the batch size as before, but training will take the same time.
My model contains a few dozen convolutions (with kernel size 2 and mostly up to 8 channels)., The total number of parameters is about 2000. Maybe the model is just very small and a speedup becomes only significant for larger kernel dimensions or channel numbers?
Thanks! Best, JZ