Hi! I’m new to PyTorch and convolutional neural networks. I’m working on a physics-informed NN (PINN) where my CNN predicts stress values at mesh nodes, and I use those to compute nodal forces in a custom (finite element–style) function as part of my loss.
The Problem
-
My model and data are all on CPU, not GPU.
-
Every epoch, RAM usage in my training process grows by several GB, even though my tensors and Python object counts (checked with the garbage collector) stay flat.
-
After several epochs, my script crashes due to running out of memory.
Here is my simplified training loop
for epoch in range(num_epochs):
optimizer.zero_grad()
predicted_stress = model(dlX) # CNN output, shape [batch, 6, 1, nNodes]
loss = forward_loss(predicted_stress, applied_forces, model, ecoords, gauss_order)
loss.backward()
optimizer.step()
Here is my simplified forward loss
def forward_loss(predicted_stress, applied_forces, model, ecoords, gauss_order):
batch_size = predicted_stress.shape[0]
loss = torch.tensor(0.0, dtype=predicted_stress.dtype, device=predicted_stress.device)
for i in range(batch_size):
y = predicted_stress[i].squeeze(1) # [6, nNodes]
nodal_forces = compute_nodal_internal_forces_torch(y, model, ecoords, gauss_order)
af = applied_forces.squeeze(0)[i] # [nNodes, 3]
# Huber loss (just as example)
diff = nodal_forces - af
loss += huber_vec.sum() # Simplified here
return loss
Here is the function compute_nodal_internal_forces_torch used in the loss function
def compute_nodal_internal_forces_torch(predicted_stress, model, ecoords, gauss_order):
device = predicted_stress.device
elements = model[‘elements’]
nNodes = ecoords.shape[0]
nDim = 3
force_indices =
force_values =
for elem in range(nElements):
node_ids = elements[elem, 1:] - 1 # [8]
elem_coords = ecoords[node_ids, :]
for gp in range(nGauss):
# ... shape functions, jacobian, etc. ...
for local_node in range(8):
global_node = node_ids[local_node]
# Build B, compute f_internal as usual
force_indices.append(global_node)
force_values.append(f_internal)
indices = torch.tensor(force_indices, dtype=torch.long, device=device)
values = torch.stack(force_values)
nodal_forces = torch.zeros((nNodes, nDim), dtype=predicted_stress.dtype, device=device)
nodal_forces = nodal_forces.index_add(0, indices, values)
return nodal_forces
What I Have Checked/Tried
- No in-place (
+=) tensor modifications—onlyindex_addat the end. - No accumulation/storing of tensors or outputs across epochs.
- RAM monitored using
psutil; tensor/object counts monitored usinggc.get_objects(). - If I wrap my force assembly in
with torch.no_grad():, RAM does not grow (but then my network can’t train). - If I replace the force assembly with a dummy zero tensor, RAM does not grow.
My Main Questions:
- Is there a PyTorch-safe, memory-leak-proof way to assemble nodal forces from model output for use in a PINN loss on CPU?
- Why does even the out-of-place
index_addpattern cause RAM to climb steadily with autograd active? - Is there a “best practice” for custom finite element/PINN force/tensor assembly to support backpropagation without blowing up memory on CPU?
Any clear explanation or example code is welcome. I’m new to PyTorch and neural networks, so I’d appreciate detailed tips!