Hi,
I am probably doing something wrong here but I am finding hard to find what exactly is going on, I am trying to train a ResNet50 on CPU to recognize the whale dataset (kaggle). My machine has 16Gb of RAM and is running Linux (ubuntu, In [2]: torch version: 0.3.0.post4).
The code main training loop looks like the following:
model = torchvision.models.resnet50(pretrained=True)
model.fc = torch.nn.Linear(in_features=2048, out_features=len(distinct_labels))
model = model.train()
def train(model, loss, optimizer, X_train, Y_train, n_epochs=200, batch_size=10):
## these lines were originally inside the loop
x_t, y_t = getBatch(X_train, Y_train, size=batch_size)
x_t = torch.autograd.Variable(x_t)
y_t = torch.autograd.Variable(y_t)
########
for i in tqdm.tqdm(range(0, n_epochs)):
for j in tqdm.tqdm(range(0, len(Y_train))):
#x_t, y_t = getBatch(X_train, Y_train, size=batch_size)
#x_t = torch.autograd.Variable(x_t)
#y_t = torch.autograd.Variable(y_t)
gc.collect()
optimizer.zero_grad()
# evaluate and optimize
outputs = model(x_t)
l_compute = loss(outputs, y_t)
l_compute.backward()
optimizer.step()
return model
optimizer = torch.optim.Adagrad(model.parameters(), lr=0.001)
loss = torch.nn.CrossEntropyLoss()
for param in model.parameters():
param.requires_grad = True
model = train(model, loss, optimizer, images, im_labels, n_epochs=1, batch_size=1)
Even if I do the sampling outside the loop (selecting x_t and y_t) it normally starves in RAM and is killed by the OS after 5 loops or so.
Am I doing anything wrong in the inner loop?
Thanks,