Speeding up GPU training on a CNN

harizas · July 4, 2019, 2:27pm

Hi
Im training a semantic segmentation model and find that my training speed is very slow. Im using a custom dataloader, and my images are about 350* 400 pixels in size. Each training instance is taking about 1.3 sec, and Im using a TeslaP100 on google Cloud.
I used time.time and profiled my training code. I found that the time taken to transfer the image data was the one taking the most amount of time, almost 1 sec.
Can someone help me with ideas on how to speed it up? should I use pytorchs dataloader and pin_memory?
Ive attached the training function below. The lines taking a long time are marked

def train(trainable_model, train_data, optimizer, epoch, criterion):
    total_train_data = train_data.__len__()
    batch_indices = np.array_split(np.random.permutation(total_train_data),
                               math.ceil(total_train_data / settings.opt['batch_size']))

   trainable_model.train()
   total_train_loss = 0

   trainable_model = trainable_model.cuda()
   criterion = criterion.cuda()

   for batch_index, indices in enumerate(tqdm(batch_indices)):
        optimizer.zero_grad()
        for idx,index in enumerate(tqdm(indices)):
           rgb, mask, filename,humanImg = train_data[index]
           var_rgb = Variable(rgb.unsqueeze(0).float())
           var_mask = Variable(mask.float())
           var_mask = var_mask.cuda()   <--- taking 5 ms
           var_rgb = var_rgb.cuda()     <--- taking 1 sec         
            
           output = trainable_model(var_rgb)
           loss = ((criterion(output, var_mask.unsqueeze(0).long()) / (len(indices))) 
           total_train_loss += loss.data[0]
           loss.backward()
                   
    del var_mask, var_rgb # after every batch delete reference to any variables
    optimizer.step()

albanD · July 4, 2019, 2:34pm

Hi,

Have use use torch.cuda.synchronize() properly when measuring times? The cuda api is asynchronous and so the times are not accurate unless you synchronize explicitly.

harizas · July 4, 2019, 5:48pm

Hi albanD
I used cuda.synchronize and found that the function that was actually taking the most time was the actual model call, not the loading into gpu calls.
I did

 torch.cuda.synchronize()
 a = time.perf_counter()
 var_mask = var_mask.cuda()
 torch.cuda.synchronize()
 b  = time.perf_counter()
 print(b-a)
 torch.cuda.synchronize()
 a = time.perf_counter()
 var_rgb = var_rgb.cuda()
 torch.cuda.synchronize()
 b = time.perf_counter()
 print(b - a)
 torch.cuda.synchronize()
 a = time.perf_counter()
 output = trainable_model(var_rgb)
 torch.cuda.synchronize()
 b  = time.perf_counter()
 print(b - a)

and got
0.00036394898779690266
0.0005174019897822291
0.8291985339892562

so it looks like just evaluating the model is taking most of the time. is it possible to speed this up? I have 4 GPUs on this computer so running it in parallel should make it faster right?

thanks