Why testing memory costs similar with training phrase

Liang · June 18, 2018, 3:07am

When I train CNN on my dataset, it takes about 7 GB GPU memory in (epoch 0, training) phrase. However, it takes 11 GB since (epoch0, testing) phrase, and it keeps 11 GB from then on. It seems that some memory are not released when switching between training and testing.

Due to GPU limits, can I release unused memory to allow larger batch_size?

Here is my training process (from official example):

for epoch in range(args.start_epoch, args.epochs):
        if args.distributed:
            train_sampler.set_epoch(epoch)
        adjust_learning_rate(optimizer, epoch)

        # train for one epoch
        train(train_loader, model, criterion, optimizer, epoch)

        # evaluate on validation set
        prec1 = validate(val_loader, model, criterion)

        # remember best prec@1 and save checkpoint
        is_best = False #prec1 > best_prec1
        best_prec1 = max(prec1, best_prec1)
        if epoch % args.save_freq == 0:
            save_checkpoint({
                'epoch': epoch + 1,
                'arch': args.arch,
                'state_dict': model.state_dict(),
                'best_prec1': best_prec1,
                'optimizer' : optimizer.state_dict(),
                'loss_acc1':plot_statistic,
            }, is_best)

Liang · June 18, 2018, 7:03am

I guess it’s pytorch’s problem. The version I used is 0.4.0 at present. When I downgrade it to 0.3.0.post4, the training memory usage is about 7 GB, and the testing usage is about 2 GB. Anyone who knows the difference between these two versions?

I replace volatile=True with with torch.no_grad():, and loss.data[0] with loss.item(). But testing still costs similar memory.

ptrblck · June 18, 2018, 8:56am

Are you running out of memory? Maybe the memory is just cached and looks like it’s used in nvidia-smi.
You can find more information about the memory management here.
If you want to release the cached memory to the OS, you could call torch.cuda.empty_cache().

54c41629a13193bc3c1f · August 9, 2018, 3:59am

Have you solved this problem? I have been confused with this, too.

lxtGH · August 9, 2018, 4:15am

@Liang @ptrblck Hi，I met the same problem, Have you solved this ??

ptrblck · August 9, 2018, 11:38am

Do you run out of memory or just see a high memory usage in nvidia-smi?
If so, it might be just the cache.

lxtGH · August 9, 2018, 12:13pm

A high memory usage and it will run out of memory. but it seems in torch-0.4.1 it doesn’t happen

solsol · March 13, 2019, 9:05am

I’m facing such this problem. for testing it run out of memory.

Any suggestion?

Liang · June 10, 2019, 12:53pm

Problem solved. I migrated from 0.3.0 to 0.4.0, so I add with torch.no_grad(): myself (in a wrong place). That is like:

def validate(val_loader, model, criterion):
    batch_time = AverageMeter()
    losses = AverageMeter()
    top1 = AverageMeter()
    top5 = AverageMeter()

    # switch to evaluate mode
    model.eval()

    end = time.time()
    for i, (inputs, target) in enumerate(val_loader):

        with torch.no_grad():
            input_var = torch.autograd.Variable(inputs.cuda())
            target_var = torch.autograd.Variable(target.cuda())

        # compute output
        output = model(input_var)
        loss = criterion[0](output, target_var)
        ...

Recently, I read the example imagenet classification code of pytorch0.4.0, and find that it puts torch.no_grad() in a different place. That is like:

def validate(val_loader, model, criterion):
    batch_time = AverageMeter()
    losses = AverageMeter()
    top1 = AverageMeter()
    top5 = AverageMeter()

    # switch to evaluate mode
    model.eval()
    
    with torch.no_grad():
        end = time.time()
        for i, (inputs, target) in enumerate(val_loader):
    
            input_var = torch.autograd.Variable(inputs.cuda())
            target_var = torch.autograd.Variable(target.cuda())
    
            # compute output
            output = model(input_var)
            loss = criterion[0](output, target_var)

IT WORKS ! Although I don’t know why.

I think this problem might easily come to PyTorch old-users, especially those migrate from 0.3.0 to the above.@ptrblck @ 54c41629a13193bc3c1f @lxtGH @solsol