Hi community!

I am trying to use neural network to learn a black box dynamics model that can predict the dynamics of a system based on the current state and input.

When I am training the network, the CPU memory usage keeps building up even though I am doing all the training on GPU(I move the model, datasets and all parameters to ‘cuda’) until at some the process is killed by ‘out of memory’. And it only happens on one of the machines, it doesn’t happen on another. They are using the exact same code, same version of Ubuntu 20.04, pytorch, torchvision, torchaudio. Every time I return something from the train_loop(), I detach it first.

Here is the forward function in my NN module, I use the network to do a forward prediction of steps 100 and return the predictions. self.customNN() is a nn.Sequential module that defines the network structure.

```
def forward(self, X):
# Input a batch of samples x
# Initialize an empty tensor for predictions
predictions = torch.empty(size=(len(X), self.state_num * self.horizon))
if self.device == 'gpu':
predictions = predictions.to('cuda')
for i in range(self.horizon):
if i == 0:
# Get the input for the first prediction
input = X[:, :self.state_num + self.input_num]
else:
# Get the newest prediction, and the newest action
last_prediction = predictions[:, self.state_num*(i - 1) : self.state_num*i]
new_action = X[:, self.state_num + i*self.input_num : self.state_num + i*self.input_num + 2]
input = torch.cat([last_prediction, new_action], dim=1)
# Predict acceleration
acceleration = self.customNN(input)
# Predict one step forward with the acceleration
# alpha[k+1] = alpha[k] + T * (norm[alpha_dot]/norm[alpha]) * alpha_dot[k]
# beta[k+1] = beta[k] + T * (norm[beta_dot]/norm[beta]) * beta_dot[k]
# alpha_dot[k+1] = alpha_dot[k] + T * (1/norm[alpha_dot]) * alpha_dotdot_prediction
# beta_dot[k+1] = beta_dot[k] + T * (1/norm[beta_dot]) * beta_dotdot_prediction
alpha_k_plus_1 = torch.reshape((input[:, 0] + input[:, 2] * self.sampling_time * \
(self.normalization_factor[2]/self.normalization_factor[0])), (len(input), 1))
beta_k_plus_1 = torch.reshape((input[:, 1] + input[:, 3] * self.sampling_time * \
(self.normalization_factor[3]/self.normalization_factor[1])), (len(input), 1))
alpha_dot_k_plus_1 = torch.reshape((input[:, 2] + acceleration[:, 0] * self.sampling_time * \
(1/self.normalization_factor[2])), (len(input), 1))
beta_dot_k_plus_1 = torch.reshape((input[:, 3] + acceleration[:, 1] * self.sampling_time * \
(1/self.normalization_factor[3])), (len(input), 1))
# Update the predictions tensor with the new predictions
predictions[:, self.state_num * i : self.state_num * (i + 1)] = torch.cat(
[alpha_k_plus_1, beta_k_plus_1, alpha_dot_k_plus_1, beta_dot_k_plus_1], dim=1
)
return predictions
```

Here is the train_loop() that is called every epoch. In this function, I call the forward function of the NN module to get the predictions of 100 steps, compare with the ground truth trajectory of 100 steps and calculate the MSE loss, then do the gradient descent. I also delete the losses after being used.

```
def train_loop(dataloader, model, loss_fn, optimizer, device, epoch, penalization_option):
# Choose whether print the offset during training
print_steady_state = False
size = len(dataloader.dataset)
num_batches = len(dataloader)
training_loss = torch.tensor(0.0)
if device == 'gpu':
training_loss = training_loss.to('cuda')
for batch, (X, y) in enumerate(dataloader):
if device == 'gpu':
X, y = X.to('cuda'), y.to('cuda')
X = X.float()
y = y.float()
model.train()
predictions = model(X)
alpha_loss = loss_fn(predictions[:,::4], y[:,::4])
beta_loss = loss_fn(predictions[:,1::4], y[:,1::4])
alpha_dot_loss = loss_fn(predictions[:,2::4], y[:,2::4])
beta_dot_loss = loss_fn(predictions[:,3::4], y[:,3::4])
loss = loss_fn(predictions, y)
training_loss += loss.detach()
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch % 25 == 0:
loss, current = loss.detach(), (batch + 1) * len(X)
alpha_loss, beta_loss, alpha_dot_loss, beta_dot_loss = \
alpha_loss.detach(), beta_loss.detach(), alpha_dot_loss.detach(), beta_dot_loss.detach()
print(f"Prediction Loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
print(f'Alpha Loss: {alpha_loss:>7f}')
print(f'Beta Loss: {beta_loss:>7f}')
print(f'Alpha Dot Loss: {alpha_dot_loss:>7f}')
print(f'Beta Dot Loss: {beta_dot_loss:>7f}')
print('')
log_cpu_ram_usage()
log_gpu_ram_usage()
print('')
del predictions, alpha_loss, beta_loss, alpha_dot_loss, beta_dot_loss
training_loss /= num_batches
print(f"Avg training loss: {training_loss:>8f} \n")
return training_loss.detach()
```

It’s weird that the memory leak happens on one of the machines but not the other. I would be nice if I could get any idea of why this would happen.

Thank you so much for your help!

William