Training Killed During Epoch

shawnbzhang · November 7, 2020, 8:32am

In the middle of training on the first epoch, my process was killed. Going into /var/log/syslog, I read: Out of memory: Killed process 70837 (python), which seems to be the reason why SIGKILL was called. Meanwhile, I have tested that if I change my batch size from 16 to 8, training can run to full completion. This leads me to the belief that my RAM usage is increasing per iteration, but resetting per epoch. What could be the reason for this, and what should I do to fix this?

Here are notable snippets of my code:

model.train()
for t in range(num_of_epochs):
    for step in range(num_of_batches):
        length_data, input, target = loader.get_a_batch()  # Get a batch of data from my custom dataloader class
        if torch.cuda.is_available():
            input = input.cuda()
            target = target.cuda()
        
        model.zero_grad()
        output = model(input)

        loss = loss_func(target, output, length_data)  # Read below for specific code snippet
        
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm(model.parameters(), grad_clip_val)
        optimizer.step()
        
        loss = float(loss.item())
        if loss < best_loss:
            torch.save(model.state_dict(), ...)
            best_loss = loss

where

def loss_func(target, output, length_data):
    loss_fn = torch.nn.MSELoss()

    sum_loss = 0.0
    for i in range(len(length_data)):
        length = length_data[i]
        sum_loss += loss_fn(target[i, 0:length, :], output[i, 0:length, :])
        
    return sum_loss

Reading on previous discussions, I feel like my problem might come from loss_func where I am summing the losses. However, I feel like the way that I am current doing is necessary for the backpropagation. Would anyone have any advice? Thank you.

Note: My model is an RNN dealing with variable-sized input, which is why you see some of the formatting above.