Estimated time of an epoch

TahaMagdy · March 15, 2018, 8:13pm

How can I calculate the estimated time of an epoch?!

ptrblck · March 15, 2018, 8:20pm

You could start a timer before training for one epoch and calculate the elapsed time afterwards.

def train(epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
    # your training code ...


t0 = time.time()
train(1)
print('{} seconds'.format(time.time() - t0)

TahaMagdy · March 15, 2018, 8:24pm

Thank you,
I have learnt Keras first. I have seen a progress bar like the following:

[============================================================] 100%, 0 seconds.

It is shown immediately once an epoch starts.

I will use your solution, but I wonder if what I have explained can be done using pytorch.

ptrblck · March 15, 2018, 8:27pm

I took this from some PyTorch examples and I like this style:

def train(epoch):
    '''
    Main training loop
    '''
    # Set model to train mode
    model.train()
    # Iterate training set
    for batch_idx, (data, mask) in enumerate(train_loader):
        if use_cuda:
            data = data.cuda()
            mask = mask.cuda()
        data = Variable(data)
        mask = Variable(mask.squeeze())
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output.squeeze(), mask)
        loss.backward()
        optimizer.step()
        
        if batch_idx % 10 == 0:
            loss_data = loss.data[0]
            train_losses.append(loss_data)
            print(
                'Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.
                format(epoch, batch_idx * len(data), len(train_loader.dataset),
                       100. * batch_idx / len(train_loader), loss_data))

Neda · January 2, 2019, 1:40pm

@ptrblck Is it essential to put model.train() before train the model for all models?

Neda · January 2, 2019, 3:25pm

@ptrblck Calculate the elapsed time would be the same when we are running the model on CUDA?

ptrblck · January 2, 2019, 4:38pm

Generally yes. After calling model.train() some layers like nn.BatchNorm will change their behavior, e.g. by updating the running estimates and using the batch statistics to normalize your activations.

Depending on your workload your training procedure should be faster running on the GPU. However, if your workload is quite small or if you have some bottlenecks in your code, the CPU time might be close to the GPU one.

wiem_chakroun · November 13, 2019, 11:00am

even when running WGAN-GP we should call netG.train(), netD.train()?
I have problem while running wgan-gp in the first 10 epoch it goes quickly (1 day) but then it tooks 1 epoch per day. My data is 8 500 hdf5 image (3D) and i am running my model on only 1 GPU. I can’t really know if it’s from the dataloader or from the network architecture or is it a normal thing ?

ptrblck · November 13, 2019, 5:30pm

It seems you are seeing a slowdown in training and it shouldn’t depend on the training/eval mode of the model.

Could you adapt the data loading measurement from the ImageNet example?
data_time should go towards zero if you don’t have a data loading bottleneck.

Also, are you the only user on this machine or could the machine be blocked by other processes during your training?
Do you see any increase in GPU memory usage during training?

wiem_chakroun · November 13, 2019, 5:53pm

so yes i have adapted and it is equal to zero (data_time Data 0.000 ( 0.000)).
nobody is using with me the machine and no the GPU memory does not increase during the training process.

ptrblck · November 13, 2019, 5:55pm

Does nvidia-smi show the same GPU utilization throughout your training?

Is the data loading time also zero after the training slows down and is the data stored locally on your machine?

wiem_chakroun · November 13, 2019, 6:02pm

yes nvidia-smi don’t show the same utilization.
So Data is still zero. BUT the GPU memory usage increase from one epoch to other. Yes the data is stored locally.

ptrblck · November 13, 2019, 6:05pm

That’s good to know.
Are you storing the loss or some other output of your model somewhere?
For debugging or printing purposes you should detach() the outputs or store the loss.item() e.g. in a loss.
Otherwise, you will store the whole computation graph, which might eventually yield an out of memory error and might also slow down your code.

Could this be the case in your code?

wiem_chakroun · November 13, 2019, 6:08pm

ok so i am storing loss.item() yes but i don’t use detach() anywhere in my code.

could you please show me a snippet of code where should i put detach()?
actually i even see people in github who use detach() after the initialization of the noise

ptrblck · November 13, 2019, 6:31pm

If you are using item(), you won’t need to use detach additionally.
However, you would need to call detach() on other tensors you are storing and which are still connected to the computation graph.
Could you post a code snippet to reproduce this slow down and how long would it take to reach the slow down?

wiem_chakroun · November 14, 2019, 8:46am

how could i detect which tensors are still connected to the computation graph?
So i have printed after each process the time.time() and i have found that:

In the first epoch it takes 1 second to do the discriminator backpropagation and zero for generator backpropagation.
After 28 epoch it takes 3 second to do the discriminator backpropagation and 4 second for generator backpropagation.
Actually i use loss.item() while storing in tensorboard does this make a difference?

 for epoch in range(opt.niter):
    print ("epoch||||||||||||||||", time.time())
    for i, batch in enumerate(dataloader,0):
        data_time = AverageMeter('Data', ':6.3f')
        print("data_time",data_time)
        print ("iteration============", time.time())
        for iter in range(3):
            print ("discriminator_iteration", time.time())
            f = open(work_dir+"training_curve.csv", "a")
            autograd.detect_anomaly()       
            ############################
            # (1) Update D network
            ############################
            ### PART ---1--- training the discriminator on real data ####
            optimizerD.zero_grad()
            x = Variable(batch.type(Tensor))
            noise = Variable(Tensor(batch.size(0), nz, 1, 1, 1).normal_(0, 1))
            x_tilde = Variable(netG(noise), requires_grad=True)
            epsilon = Variable(Tensor(batch.size(0), 1, 1, 1, 1).normal_(0, 1))
            x_hat = epsilon*x + (1 - epsilon)*x_tilde
            x_hat = torch.autograd.Variable(x_hat, requires_grad=True)
            dw_x = netD(x_hat)
        
            grad_x = torch.autograd.grad(outputs=dw_x, inputs=x_hat,
                                        grad_outputs=Variable(Tensor(dw_x.size()).fill_(1.0), requires_grad=False),
                                        create_graph=True, retain_graph=True, only_inputs=True)

            grad_x = grad_x[0].view(batch.size(0), -1)
            grad_x = grad_x.norm(p=2, dim=1)
            d_loss = torch.mean(netD(x_tilde)) - torch.mean(netD(x)) + LAMBDA*torch.mean((grad_x - 1)**2) 
            gradient_penal=LAMBDA*torch.mean((grad_x - 1)**2)
            wgan_loss=torch.mean(netD(x_tilde)) - torch.mean(netD(x))
            print ("before back discriminator|||||||||||||||", time.time())
            d_loss.backward()
            print ("after back discriminator==============", time.time())
            for p in netD.parameters():
                p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))
            print ("after gradient clipping|||||||||||", time.time())
            for p in list(filter(lambda p: p.grad is not None, netD.parameters())):
                print("Discriminator",p.grad.data.norm(2).item())
            optimizerD.step()
        ###########################
        # (2) Update G network
        ###########################
        optimizerG.zero_grad()
        print ("before generator iter||||||||||||", time.time())
        noise = Variable(Tensor(batch.size(0), nz, 1, 1,1).normal_(0, 1))
        imgs_fake = netG(noise)
        g_loss = -torch.mean(netD(imgs_fake)) ##minimiser
        print ("before back generator==========", time.time())
        g_loss.backward()
        print ("after back generator||||||||||||", time.time())
        for p in netG.parameters():
            p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))
        for p in list(filter(lambda p: p.grad is not None, netG.parameters())):
            print("Generator",p.grad.data.norm(2).item())
        print ("befor weight update generateur===========", time.time())
        optimizerG.step()
       ## TensorBoard
        writer.add_scalar('G_loss', g_loss.item(), i + epoch * len(dataloader))
        writer.add_scalar('D_loss', d_loss.item(), i + epoch * len(dataloader))
        writer.add_scalar('gradient_penal', gradient_penal.item(), i + epoch * len(dataloader))
        writer.add_scalar('wgan_loss', wgan_loss.item(), i + epoch * len(dataloader))
        writer.add_scalar('fake_loss', fake_loss.item(), i + epoch * len(dataloader))
        writer.add_scalar('real_loss', real_loss.item(), i + epoch * len(dataloader))
        f.write('[%d/%d][%d/%d] Loss_D: %.4f Gradient_p: %.4f WGAN_loss: %.4f Loss_G: %.4f'
            % (epoch, opt.niter, i, len(dataloader),
                d_loss.data, gradient_penal.data, wgan_loss.data, g_loss.data))
        f.write('\n')
        f.close()

luanpham · November 14, 2019, 9:55am

Try tqdm.

Example

        for i, (images, targets) in tqdm(
                enumerate(self._train_loader),
                total=len(self._train_loader),
                leave=False
            ):

it will show something like this,
89%|######################################################### | 534/599 [07:56<00:57, 1.13it/s]

ptrblck · November 14, 2019, 7:11pm

Thanks for the code!

autograd.detect_anomaly() should be quite expensive, so if your code runs fine, I would recommend to remove this call.
Variable(batch.type(Tensor)): Variables are deprecated since 0.4.0, so you can just use tensors in newer versions. Also, if you want to convert the dtype, you should use to() or directly call the type as e.g. float().
I haven’t debugged your code, but do you really need to retain the graph in grad_x? I’m a bit worried it might be stored unnecessarily.
Since you are running the code on a GPU and CUDA calls are asynchronous, you should call torch.cuda.synchronize() before starting and stopping the timer. Otherwise your profiling might yield wrong results (e.g. kernel launch times)

wiem_chakroun · November 17, 2019, 9:35am

thanks for your response!
This have accelerated the training process.