PyTorch copying to GPU is slow

Semir_Elezovikj · March 11, 2021, 3:51pm

The transfers to GPU seem to be slow - wondering whether I can do anything with the below snippet of code to speed things up.

Gpu.IDS is just an array. Example: IDS = [4, 5, 6]

device = torch.device('cuda:' + str(Gpu.IDS[0]) if torch.cuda.is_available() else 'cpu')
print('running on ', device)
model = model.to(device)

if torch.cuda.device_count() > 1:
        print("Let's use multiple GPUs:", len(Gpu.IDS), "out of ", torch.cuda.device_count(), "GPUs!")      
        model = nn.DataParallel(model, device_ids=Gpu.IDS)

tokens = train

dataset = NuscenesDataset(tokens, helper)
dataloader = DataLoader(dataset, batch_size=16*len(Gpu.IDS), num_workers=4 * len(Gpu.IDS))

n_iter = 0

minimum_loss = 0

loss_function = MTPLoss(NUM_MODES, 1, 5)
current_loss = 10000
model_dir = make_model_dir()
learning_rates = [0.1]


for lr in learning_rates:
    # optimizer = optim.SGD(model.parameters(), lr=lr, momentum = 0.9)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    print('--------- LEARNING RATE ', lr, '--------------')
    for epoch in range(1, 10001): 
        # print below - only for large dataset
        # print('-> epoch:', epoch)
        for img, agent_state_vector, ground_truth, _, _ in dataloader:            
    #         imshow(torchvision.utils.make_grid(img)) 
            img = img.to(device)
            agent_state_vector = agent_state_vector.to(device)    
            ground_truth = ground_truth.to(device)
            ground_truth = ground_truth.to(torch.float32)
            optimizer.zero_grad()

            prediction = model(img, agent_state_vector)            
    #         print('prediction.shape', prediction.shape, 'ground_truth.shape', ground_truth.shape)
            loss = loss_function(prediction, ground_truth)
            loss.backward()
            optimizer.step()

            current_loss = loss.cpu().detach().numpy()

albanD · March 11, 2021, 4:13pm

Hi,

Why do you think the transfer is slow?
Keep in mind that the cuda API is asynchronous except when it needs to deal with CPU values. So if you measure without manual synchronization with torch.cuda.synchronize(), then it will appear that the copies are slow only because they wait for the rest of the computations before being able to execute as they deal with CPU values.

Semir_Elezovikj · March 11, 2021, 4:29pm

Hi,

I guess, my question is more along the lines of whether something is an obvious red flag in my setup.

albanD · March 11, 2021, 5:17pm

Hi,

No the code looks ok.
The one thing is that you can collapse this:

            ground_truth = ground_truth.to(device)
            ground_truth = ground_truth.to(torch.float32)

into

            ground_truth = ground_truth.to(device=device, dtype=torch.float32)

Semir_Elezovikj · March 11, 2021, 5:17pm

Thanks - much appreciated