Memory issues while training Resnet50


I am probably doing something wrong here but I am finding hard to find what exactly is going on, I am trying to train a ResNet50 on CPU to recognize the whale dataset (kaggle). My machine has 16Gb of RAM and is running Linux (ubuntu, In [2]: torch version: 0.3.0.post4).

The code main training loop looks like the following:

model = torchvision.models.resnet50(pretrained=True)
model.fc = torch.nn.Linear(in_features=2048, out_features=len(distinct_labels))
model = model.train()

def train(model, loss, optimizer, X_train, Y_train, n_epochs=200, batch_size=10):
    ## these lines were originally inside the loop
    x_t, y_t = getBatch(X_train, Y_train, size=batch_size)
    x_t = torch.autograd.Variable(x_t)
    y_t = torch.autograd.Variable(y_t)
    for i in tqdm.tqdm(range(0, n_epochs)):
        for j in tqdm.tqdm(range(0, len(Y_train))):
            #x_t, y_t = getBatch(X_train, Y_train, size=batch_size)
            #x_t = torch.autograd.Variable(x_t)
            #y_t = torch.autograd.Variable(y_t)
            # evaluate and optimize
            outputs = model(x_t)
            l_compute = loss(outputs, y_t)
    return model

optimizer = torch.optim.Adagrad(model.parameters(), lr=0.001)
loss = torch.nn.CrossEntropyLoss()

for param in model.parameters():
    param.requires_grad = True

model = train(model, loss, optimizer, images, im_labels, n_epochs=1, batch_size=1)

Even if I do the sampling outside the loop (selecting x_t and y_t) it normally starves in RAM and is killed by the OS after 5 loops or so.

Am I doing anything wrong in the inner loop?


Well, turns out this is not a memory leak, I am just running out of memory because the model takes about 13Gb of memory while training (on cpu) and sometimes gets killed depending on what else is running on the machine.

I was considering getting a Nvidia Gtx 1080 for speeding things up, do anyone has some experience training Resnets on that GPU and will the 8Gb version of this board be enough?