Estimated time of an epoch

How can I calculate the estimated time of an epoch?!

1 Like

You could start a timer before training for one epoch and calculate the elapsed time afterwards.

def train(epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
    # your training code ...


t0 = time.time()
train(1)
print('{} seconds'.format(time.time() - t0)
2 Likes

Thank you,
I have learnt Keras first. I have seen a progress bar like the following:

[============================================================] 100%, 0 seconds.

It is shown immediately once an epoch starts.

I will use your solution, but I wonder if what I have explained can be done using pytorch.

I took this from some PyTorch examples and I like this style:

def train(epoch):
    '''
    Main training loop
    '''
    # Set model to train mode
    model.train()
    # Iterate training set
    for batch_idx, (data, mask) in enumerate(train_loader):
        if use_cuda:
            data = data.cuda()
            mask = mask.cuda()
        data = Variable(data)
        mask = Variable(mask.squeeze())
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output.squeeze(), mask)
        loss.backward()
        optimizer.step()
        
        if batch_idx % 10 == 0:
            loss_data = loss.data[0]
            train_losses.append(loss_data)
            print(
                'Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.
                format(epoch, batch_idx * len(data), len(train_loader.dataset),
                       100. * batch_idx / len(train_loader), loss_data))

3 Likes

@ptrblck Is it essential to put model.train() before train the model for all models?

@ptrblck Calculate the elapsed time would be the same when we are running the model on CUDA?

Generally yes. After calling model.train() some layers like nn.BatchNorm will change their behavior, e.g. by updating the running estimates and using the batch statistics to normalize your activations.

Depending on your workload your training procedure should be faster running on the GPU. However, if your workload is quite small or if you have some bottlenecks in your code, the CPU time might be close to the GPU one.

2 Likes

even when running WGAN-GP we should call netG.train(), netD.train()?
I have problem while running wgan-gp in the first 10 epoch it goes quickly (1 day) but then it tooks 1 epoch per day. My data is 8 500 hdf5 image (3D) and i am running my model on only 1 GPU. I can’t really know if it’s from the dataloader or from the network architecture or is it a normal thing ?

It seems you are seeing a slowdown in training and it shouldn’t depend on the training/eval mode of the model.

Could you adapt the data loading measurement from the ImageNet example?
data_time should go towards zero if you don’t have a data loading bottleneck.

Also, are you the only user on this machine or could the machine be blocked by other processes during your training?
Do you see any increase in GPU memory usage during training?

so yes i have adapted and it is equal to zero (data_time Data 0.000 ( 0.000)).
nobody is using with me the machine and no the GPU memory does not increase during the training process.

Does nvidia-smi show the same GPU utilization throughout your training?

Is the data loading time also zero after the training slows down and is the data stored locally on your machine?

yes nvidia-smi don’t show the same utilization.
So Data is still zero. BUT the GPU memory usage increase from one epoch to other. Yes the data is stored locally.

That’s good to know.
Are you storing the loss or some other output of your model somewhere?
For debugging or printing purposes you should detach() the outputs or store the loss.item() e.g. in a loss.
Otherwise, you will store the whole computation graph, which might eventually yield an out of memory error and might also slow down your code.

Could this be the case in your code?

ok so i am storing loss.item() yes but i don’t use detach() anywhere in my code.

could you please show me a snippet of code where should i put detach()?
actually i even see people in github who use detach() after the initialization of the noise

If you are using item(), you won’t need to use detach additionally.
However, you would need to call detach() on other tensors you are storing and which are still connected to the computation graph.
Could you post a code snippet to reproduce this slow down and how long would it take to reach the slow down?

how could i detect which tensors are still connected to the computation graph?
So i have printed after each process the time.time() and i have found that:

  • In the first epoch it takes 1 second to do the discriminator backpropagation and zero for generator backpropagation.
  • After 28 epoch it takes 3 second to do the discriminator backpropagation and 4 second for generator backpropagation.
    Actually i use loss.item() while storing in tensorboard does this make a difference?
 for epoch in range(opt.niter):
    print ("epoch||||||||||||||||", time.time())
    for i, batch in enumerate(dataloader,0):
        data_time = AverageMeter('Data', ':6.3f')
        print("data_time",data_time)
        print ("iteration============", time.time())
        for iter in range(3):
            print ("discriminator_iteration", time.time())
            f = open(work_dir+"training_curve.csv", "a")
            autograd.detect_anomaly()       
            ############################
            # (1) Update D network
            ############################
            ### PART ---1--- training the discriminator on real data ####
            optimizerD.zero_grad()
            x = Variable(batch.type(Tensor))
            noise = Variable(Tensor(batch.size(0), nz, 1, 1, 1).normal_(0, 1))
            x_tilde = Variable(netG(noise), requires_grad=True)
            epsilon = Variable(Tensor(batch.size(0), 1, 1, 1, 1).normal_(0, 1))
            x_hat = epsilon*x + (1 - epsilon)*x_tilde
            x_hat = torch.autograd.Variable(x_hat, requires_grad=True)
            dw_x = netD(x_hat)
        
            grad_x = torch.autograd.grad(outputs=dw_x, inputs=x_hat,
                                        grad_outputs=Variable(Tensor(dw_x.size()).fill_(1.0), requires_grad=False),
                                        create_graph=True, retain_graph=True, only_inputs=True)

            grad_x = grad_x[0].view(batch.size(0), -1)
            grad_x = grad_x.norm(p=2, dim=1)
            d_loss = torch.mean(netD(x_tilde)) - torch.mean(netD(x)) + LAMBDA*torch.mean((grad_x - 1)**2) 
            gradient_penal=LAMBDA*torch.mean((grad_x - 1)**2)
            wgan_loss=torch.mean(netD(x_tilde)) - torch.mean(netD(x))
            print ("before back discriminator|||||||||||||||", time.time())
            d_loss.backward()
            print ("after back discriminator==============", time.time())
            for p in netD.parameters():
                p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))
            print ("after gradient clipping|||||||||||", time.time())
            for p in list(filter(lambda p: p.grad is not None, netD.parameters())):
                print("Discriminator",p.grad.data.norm(2).item())
            optimizerD.step()
        ###########################
        # (2) Update G network
        ###########################
        optimizerG.zero_grad()
        print ("before generator iter||||||||||||", time.time())
        noise = Variable(Tensor(batch.size(0), nz, 1, 1,1).normal_(0, 1))
        imgs_fake = netG(noise)
        g_loss = -torch.mean(netD(imgs_fake)) ##minimiser
        print ("before back generator==========", time.time())
        g_loss.backward()
        print ("after back generator||||||||||||", time.time())
        for p in netG.parameters():
            p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))
        for p in list(filter(lambda p: p.grad is not None, netG.parameters())):
            print("Generator",p.grad.data.norm(2).item())
        print ("befor weight update generateur===========", time.time())
        optimizerG.step()
       ## TensorBoard
        writer.add_scalar('G_loss', g_loss.item(), i + epoch * len(dataloader))
        writer.add_scalar('D_loss', d_loss.item(), i + epoch * len(dataloader))
        writer.add_scalar('gradient_penal', gradient_penal.item(), i + epoch * len(dataloader))
        writer.add_scalar('wgan_loss', wgan_loss.item(), i + epoch * len(dataloader))
        writer.add_scalar('fake_loss', fake_loss.item(), i + epoch * len(dataloader))
        writer.add_scalar('real_loss', real_loss.item(), i + epoch * len(dataloader))
        f.write('[%d/%d][%d/%d] Loss_D: %.4f Gradient_p: %.4f WGAN_loss: %.4f Loss_G: %.4f'
            % (epoch, opt.niter, i, len(dataloader),
                d_loss.data, gradient_penal.data, wgan_loss.data, g_loss.data))
        f.write('\n')
        f.close()

Try tqdm.

Example

        for i, (images, targets) in tqdm(
                enumerate(self._train_loader),
                total=len(self._train_loader),
                leave=False
            ):
 

it will show something like this,
89%|######################################################### | 534/599 [07:56<00:57, 1.13it/s]

Thanks for the code!

  • autograd.detect_anomaly() should be quite expensive, so if your code runs fine, I would recommend to remove this call.
  • Variable(batch.type(Tensor)): Variables are deprecated since 0.4.0, so you can just use tensors in newer versions. Also, if you want to convert the dtype, you should use to() or directly call the type as e.g. float().
  • I haven’t debugged your code, but do you really need to retain the graph in grad_x? I’m a bit worried it might be stored unnecessarily.
  • Since you are running the code on a GPU and CUDA calls are asynchronous, you should call torch.cuda.synchronize() before starting and stopping the timer. Otherwise your profiling might yield wrong results (e.g. kernel launch times)
1 Like

thanks for your response!
This have accelerated the training process.