Network is using more GPU when you reload it compared to starting from scratch

Hi,

I have found some strange behavior when trying to continue the training of a network.
I have a Titan Xp with 12GB ram, one of the networks I train on it is around 6GB in size.
When I stop training and want to resume training, I reload the network and suddenly the total size of the network is 8GB? because of this I can’t train two networks on my GPU and if i want to train both i have to restart training from scratch?

Here is my saving code:

def save_checkpoint(state, filename):
    torch.save(state, filename)

 save_checkpoint({
                'epoch': epoch,
                'state_dict': net.net.state_dict(),
                'optimizer': optimizer.state_dict(),
            }, OUTPUT_DIR + 'siamese_epoch{}.pth.tar'.format(epoch))

My loading code is:

def reload_network(reload_model, gpu):
    save_point = torch.load(reload_model, map_location=lambda storage, loc:storage.cuda(gpu))
    return save_point

save_point = reload_network(args.reload, args.gpu)
start_epoch = save_point["epoch"] + 1

net.net.load_state_dict(save_point["state_dict"])
net.cuda(args.gpu)
optimizer.load_state_dict(save_point['optimizer'])

Did you delete reference to save_point after loading?

Thanks that solved it! It’s weird that this can’t be seen anywhere in any doc as far as I’ve seen.

Hmm if you don’t delete you have a separate copy of the weights… If you hold reference to Python objects, they won’t get deallocated… I don’t think this should be in a PyTorch doc though…