.to('cuda') taking super long time

device: gtx 1070
tensor info:
variable shape of approx: [150, 6]
dtype: torch.int16
I’ve also tested float32 but got the same results

tq = tqdm(trainset)
for i in tq:
    optimizer.zero_grad()
    tic = time.time()
    pred = model(i[0].to(device))
    toc = time.time()
    #This is what is taking so long
    i[1] = i[1].cuda()#.to(device)
    tooc = time.time()
    loss = test(pred, i[1])
    
    loss[0].backward()
    optimizer.step()
    times = loss[1]
    times['inference'] = int((toc-tic) * 1000)
    times['to cuda!?!?'] = int((tooc-toc) * 1000)
    tq.set_postfix(times)

output:
build anchors=40, big loop=114, inference=90, to cuda!?!?=573
units: ms.

See if

i[1] = i[1].cuda(non_blocking=True)

makes a difference.

I think that for this to make a difference you have to specify pin_memory=True as an argument while constructing the DataLoader object, but I am not entirely sure. If merely setting non_blocking=True does not give any improvement, and if you do use a DataLoader object to construct trainset, then try passing pin_memory=True to its constructor, as well.

CUDA operations are executed asynchronously, so you would need to synchronize the code before starting and stopping each timer via torch.cuda.synchronize(). Otherwise the times will be accumulated in blocking operations and are thus wrong.