Blended Precision

Nick_8229 · February 13, 2020, 1:36am

I am trying blended precision for tensor operation, and find the runtime is not improved, and actually worse. Anything i missed?

torch.cuda.get_device_name()

‘Tesla T4’

import torch
import time

a = torch.zeros(60000, 200, dtype=torch.float16, device='cuda')
b = torch.tensor([0.01, 0.03, 0.05, 0.07], dtype=torch.float32, device='cuda')

for i in range(5):
    start = time.time()
    for j in range(1000):
        a.normal_()
        d = a * b.view(-1, 1, 1)
    print(time.time() - start)

0.9754595756530762
1.9229979515075684
1.913642168045044
1.9154479503631592
1.9257619380950928

import torch
import time

a = torch.zeros(60000, 200, dtype=torch.float32, device='cuda')
b = torch.tensor([0.01, 0.03, 0.05, 0.07], dtype=torch.float32, device='cuda')



for i in range(5):
    start = time.time()
    for j in range(1000):
        a.normal_()
        d = a * b.view(-1, 1, 1)
    print(time.time() - start)

0.910468339920044
1.8464961051940918
1.8373432159423828
1.841585397720337
1.8456008434295654

ptrblck · February 13, 2020, 5:15am

Since CUDA operations are asynchronous, you would have to synchronize the code via torch.cuda.synchronize() before starting and stopping the timer.
Also, do you want to use FP16 and FP32 in your first example or should both be FP16?