Hi @ptrblck ,thanks. Below is the snippet:
import torch
from timm.models.vision_transformer import VisionTransformer
from timm.models import create_model
import time
from torch.cuda.amp import GradScaler, autocast
def train(model, input, amp_enable=False):
torch.cuda.synchronize()
time_start=time.time()
for i in range(0, 5):
with autocast(enabled=amp_enable):
out=model(input)
out.sum().backward()
torch.cuda.synchronize()
print(f'time used: {time.time()-time_start}')
if __name__=='__main__':
model=create_model('vit_large_patch16_384',
pretrained=False,
num_classes=None,
drop_rate=0,
drop_path_rate=0.3)
model.cuda().train()
input=torch.rand(32,3,384,384).cuda()
# warmup, ignore
train(model, input)
print('----train with fp32----')
train(model, input)
print('----train with autocast----')
train(model, input, amp_enable=True)
print('----train with tf32----')
torch.backends.cuda.matmul.allow_tf32=True
torch.backends.cudnn.allow_tf32=True
train(model, input)
You will have to install timm (GitHub - rwightman/pytorch-image-models: PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more) first.
I run this code on Pytorch1.12 on A100-80GB GPU and I got:
time used: 17.40752673149109
----train with fp32----
time used: 11.999516010284424
----train with autocast----
time used: 4.917150497436523
----train with tf32----
time used: 3.436387538909912
Result shows that TF32 is faster than autocast. Is there any mistakes in my code?