I am probably doing something wrong here but I am finding hard to find what exactly is going on, I am trying to train a ResNet50 on CPU to recognize the whale dataset (kaggle). My machine has 16Gb of RAM and is running Linux (ubuntu, In : torch version: 0.3.0.post4).
The code main training loop looks like the following:
model = torchvision.models.resnet50(pretrained=True) model.fc = torch.nn.Linear(in_features=2048, out_features=len(distinct_labels)) model = model.train() def train(model, loss, optimizer, X_train, Y_train, n_epochs=200, batch_size=10): ## these lines were originally inside the loop x_t, y_t = getBatch(X_train, Y_train, size=batch_size) x_t = torch.autograd.Variable(x_t) y_t = torch.autograd.Variable(y_t) ######## for i in tqdm.tqdm(range(0, n_epochs)): for j in tqdm.tqdm(range(0, len(Y_train))): #x_t, y_t = getBatch(X_train, Y_train, size=batch_size) #x_t = torch.autograd.Variable(x_t) #y_t = torch.autograd.Variable(y_t) gc.collect() optimizer.zero_grad() # evaluate and optimize outputs = model(x_t) l_compute = loss(outputs, y_t) l_compute.backward() optimizer.step() return model optimizer = torch.optim.Adagrad(model.parameters(), lr=0.001) loss = torch.nn.CrossEntropyLoss() for param in model.parameters(): param.requires_grad = True model = train(model, loss, optimizer, images, im_labels, n_epochs=1, batch_size=1)
Even if I do the sampling outside the loop (selecting x_t and y_t) it normally starves in RAM and is killed by the OS after 5 loops or so.
Am I doing anything wrong in the inner loop?