PyTorch CPU RAM Usage Grows Rapidly When Assembling Forces from CNN Output—How to Prevent Memory Leak?

Manuel_Lucas_Sampaio · December 11, 2025, 8:38pm

Hi! I’m new to PyTorch and convolutional neural networks. I’m working on a physics-informed NN (PINN) where my CNN predicts stress values at mesh nodes, and I use those to compute nodal forces in a custom (finite element–style) function as part of my loss.

The Problem

My model and data are all on CPU, not GPU.
Every epoch, RAM usage in my training process grows by several GB, even though my tensors and Python object counts (checked with the garbage collector) stay flat.
After several epochs, my script crashes due to running out of memory.

Here is my simplified training loop
for epoch in range(num_epochs):
optimizer.zero_grad()
predicted_stress = model(dlX) # CNN output, shape [batch, 6, 1, nNodes]
loss = forward_loss(predicted_stress, applied_forces, model, ecoords, gauss_order)
loss.backward()
optimizer.step()

Here is my simplified forward loss
def forward_loss(predicted_stress, applied_forces, model, ecoords, gauss_order):
batch_size = predicted_stress.shape[0]
loss = torch.tensor(0.0, dtype=predicted_stress.dtype, device=predicted_stress.device)
for i in range(batch_size):
y = predicted_stress[i].squeeze(1) # [6, nNodes]
nodal_forces = compute_nodal_internal_forces_torch(y, model, ecoords, gauss_order)
af = applied_forces.squeeze(0)[i] # [nNodes, 3]
# Huber loss (just as example)
diff = nodal_forces - af
loss += huber_vec.sum() # Simplified here
return loss

Here is the function compute_nodal_internal_forces_torch used in the loss function
def compute_nodal_internal_forces_torch(predicted_stress, model, ecoords, gauss_order):
device = predicted_stress.device
elements = model[‘elements’]
nNodes = ecoords.shape[0]
nDim = 3
force_indices =
force_values =

for elem in range(nElements):
    node_ids = elements[elem, 1:] - 1  # [8]
    elem_coords = ecoords[node_ids, :]
    for gp in range(nGauss):
        # ... shape functions, jacobian, etc. ...
        for local_node in range(8):
            global_node = node_ids[local_node]
            # Build B, compute f_internal as usual
            force_indices.append(global_node)
            force_values.append(f_internal)
indices = torch.tensor(force_indices, dtype=torch.long, device=device)
values = torch.stack(force_values)
nodal_forces = torch.zeros((nNodes, nDim), dtype=predicted_stress.dtype, device=device)
nodal_forces = nodal_forces.index_add(0, indices, values)
return nodal_forces

What I Have Checked/Tried

No in-place (+=) tensor modifications—only index_add at the end.
No accumulation/storing of tensors or outputs across epochs.
RAM monitored using psutil; tensor/object counts monitored using gc.get_objects().
If I wrap my force assembly in with torch.no_grad():, RAM does not grow (but then my network can’t train).
If I replace the force assembly with a dummy zero tensor, RAM does not grow.

My Main Questions:

Is there a PyTorch-safe, memory-leak-proof way to assemble nodal forces from model output for use in a PINN loss on CPU?
Why does even the out-of-place index_add pattern cause RAM to climb steadily with autograd active?
Is there a “best practice” for custom finite element/PINN force/tensor assembly to support backpropagation without blowing up memory on CPU?

Any clear explanation or example code is welcome. I’m new to PyTorch and neural networks, so I’d appreciate detailed tips!