The CPU memory usage keeps building up during training

Hi community!

I am trying to use neural network to learn a black box dynamics model that can predict the dynamics of a system based on the current state and input.

When I am training the network, the CPU memory usage keeps building up even though I am doing all the training on GPU(I move the model, datasets and all parameters to ‘cuda’) until at some the process is killed by ‘out of memory’. And it only happens on one of the machines, it doesn’t happen on another. They are using the exact same code, same version of Ubuntu 20.04, pytorch, torchvision, torchaudio. Every time I return something from the train_loop(), I detach it first.

Here is the forward function in my NN module, I use the network to do a forward prediction of steps 100 and return the predictions. self.customNN() is a nn.Sequential module that defines the network structure.

def forward(self, X):
        # Input a batch of samples x
        # Initialize an empty tensor for predictions
        predictions = torch.empty(size=(len(X), self.state_num * self.horizon))
        if self.device == 'gpu':
            predictions ='cuda')
        for i in range(self.horizon):
            if i == 0:
                # Get the input for the first prediction
                input = X[:, :self.state_num + self.input_num]
                # Get the newest prediction, and the newest action
                last_prediction = predictions[:, self.state_num*(i - 1) : self.state_num*i]
                new_action = X[:, self.state_num + i*self.input_num : self.state_num + i*self.input_num + 2]
                input =[last_prediction, new_action], dim=1)

            # Predict acceleration
            acceleration = self.customNN(input)

            # Predict one step forward with the acceleration
            # alpha[k+1] = alpha[k] + T * (norm[alpha_dot]/norm[alpha]) * alpha_dot[k]
            # beta[k+1] = beta[k] + T * (norm[beta_dot]/norm[beta]) * beta_dot[k]
            # alpha_dot[k+1] = alpha_dot[k] + T * (1/norm[alpha_dot]) * alpha_dotdot_prediction
            # beta_dot[k+1] = beta_dot[k] + T * (1/norm[beta_dot]) * beta_dotdot_prediction

            alpha_k_plus_1 = torch.reshape((input[:, 0] + input[:, 2] * self.sampling_time * \
                                            (self.normalization_factor[2]/self.normalization_factor[0])), (len(input), 1))
            beta_k_plus_1 = torch.reshape((input[:, 1] + input[:, 3] * self.sampling_time * \
                                           (self.normalization_factor[3]/self.normalization_factor[1])), (len(input), 1))
            alpha_dot_k_plus_1 = torch.reshape((input[:, 2] + acceleration[:, 0] * self.sampling_time * \
                                                (1/self.normalization_factor[2])), (len(input), 1))
            beta_dot_k_plus_1 = torch.reshape((input[:, 3] + acceleration[:, 1] * self.sampling_time * \
                                               (1/self.normalization_factor[3])), (len(input), 1))

            # Update the predictions tensor with the new predictions
            predictions[:, self.state_num * i : self.state_num * (i + 1)] =
                [alpha_k_plus_1, beta_k_plus_1, alpha_dot_k_plus_1, beta_dot_k_plus_1], dim=1

        return predictions

Here is the train_loop() that is called every epoch. In this function, I call the forward function of the NN module to get the predictions of 100 steps, compare with the ground truth trajectory of 100 steps and calculate the MSE loss, then do the gradient descent. I also delete the losses after being used.

def train_loop(dataloader, model, loss_fn, optimizer, device, epoch, penalization_option):
    # Choose whether print the offset during training
    print_steady_state = False

    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    training_loss = torch.tensor(0.0)

    if device == 'gpu':
        training_loss ='cuda')

    for batch, (X, y) in enumerate(dataloader):
        if device == 'gpu':
            X, y ='cuda'),'cuda')
        X = X.float()
        y = y.float()

        predictions = model(X)

        alpha_loss = loss_fn(predictions[:,::4], y[:,::4])
        beta_loss = loss_fn(predictions[:,1::4], y[:,1::4])
        alpha_dot_loss = loss_fn(predictions[:,2::4], y[:,2::4])
        beta_dot_loss = loss_fn(predictions[:,3::4], y[:,3::4])

        loss = loss_fn(predictions, y)

        training_loss += loss.detach()

        # Backpropagation
        if batch % 25 == 0:
            loss, current = loss.detach(), (batch + 1) * len(X)
            alpha_loss, beta_loss, alpha_dot_loss, beta_dot_loss = \
                alpha_loss.detach(), beta_loss.detach(), alpha_dot_loss.detach(), beta_dot_loss.detach()
            print(f"Prediction Loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
            print(f'Alpha Loss: {alpha_loss:>7f}')
            print(f'Beta Loss: {beta_loss:>7f}')
            print(f'Alpha Dot Loss: {alpha_dot_loss:>7f}')
            print(f'Beta Dot Loss: {beta_dot_loss:>7f}')

        del predictions, alpha_loss, beta_loss, alpha_dot_loss, beta_dot_loss

    training_loss /= num_batches
    print(f"Avg training loss: {training_loss:>8f} \n")
    return training_loss.detach()

It’s weird that the memory leak happens on one of the machines but not the other. I would be nice if I could get any idea of why this would happen.

Thank you so much for your help!

This is indeed weird and based on your description every library version is the same. In that case I doubt we would be able to narrow it down as the code apparently isn’t able to reproduce it.