Speeding up GPU training on a CNN

Hi
Im training a semantic segmentation model and find that my training speed is very slow. Im using a custom dataloader, and my images are about 350* 400 pixels in size. Each training instance is taking about 1.3 sec, and Im using a TeslaP100 on google Cloud.
I used time.time and profiled my training code. I found that the time taken to transfer the image data was the one taking the most amount of time, almost 1 sec.
Can someone help me with ideas on how to speed it up? should I use pytorchs dataloader and pin_memory?
Ive attached the training function below. The lines taking a long time are marked

def train(trainable_model, train_data, optimizer, epoch, criterion):
    total_train_data = train_data.__len__()
    batch_indices = np.array_split(np.random.permutation(total_train_data),
                               math.ceil(total_train_data / settings.opt['batch_size']))

   trainable_model.train()
   total_train_loss = 0

   trainable_model = trainable_model.cuda()
   criterion = criterion.cuda()

   for batch_index, indices in enumerate(tqdm(batch_indices)):
        optimizer.zero_grad()
        for idx,index in enumerate(tqdm(indices)):
           rgb, mask, filename,humanImg = train_data[index]
           var_rgb = Variable(rgb.unsqueeze(0).float())
           var_mask = Variable(mask.float())
           var_mask = var_mask.cuda()   <--- taking 5 ms
           var_rgb = var_rgb.cuda()     <--- taking 1 sec         
            
           output = trainable_model(var_rgb)
           loss = ((criterion(output, var_mask.unsqueeze(0).long()) / (len(indices))) 
           total_train_loss += loss.data[0]
           loss.backward()
                   
    del var_mask, var_rgb # after every batch delete reference to any variables
    optimizer.step()

Hi,

Have use use torch.cuda.synchronize() properly when measuring times? The cuda api is asynchronous and so the times are not accurate unless you synchronize explicitly.

Hi albanD
I used cuda.synchronize and found that the function that was actually taking the most time was the actual model call, not the loading into gpu calls.
I did

 torch.cuda.synchronize()
 a = time.perf_counter()
 var_mask = var_mask.cuda()
 torch.cuda.synchronize()
 b  = time.perf_counter()
 print(b-a)
 torch.cuda.synchronize()
 a = time.perf_counter()
 var_rgb = var_rgb.cuda()
 torch.cuda.synchronize()
 b = time.perf_counter()
 print(b - a)
 torch.cuda.synchronize()
 a = time.perf_counter()
 output = trainable_model(var_rgb)
 torch.cuda.synchronize()
 b  = time.perf_counter()
 print(b - a)

and got
0.00036394898779690266
0.0005174019897822291
0.8291985339892562

so it looks like just evaluating the model is taking most of the time. is it possible to speed this up? I have 4 GPUs on this computer so running it in parallel should make it faster right?

thanks