Hi everyone,
I am investigating a persistent CPU RAM growth issue during neural network training and would appreciate any ideas or suggestions.
The behavior has been reproduced:
-
on two different machines,
-
with three different model architectures,
-
including a minimal LSTM → Linear model.
Environment
-
PyTorch: 2.7.1
-
CUDA: 12.8
-
cuDNN: 9.10.0.2 (reported by
torch.backends.cudnn.version()) -
Python: 3.13.3
The issue has been reproduced on two different systems:
System 1
-
Windows VM running on a Linux server
-
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
-
128 GB GPU memory
System 2
-
Native Windows 11 machine
-
NVIDIA RTX 6000 Ada Generation
-
111 GB GPU memory
The behavior is visible on both systems.
Observed behavior
CPU RAM usage continuously increases during training.
Important observations:
-
RAM growth occurs during a single training run.
-
RAM grows continuously rather than in large jumps.
-
Available system RAM decreases accordingly (this is not only an RSS reporting artifact).
-
GPU utilization remains high.
-
GPU memory usage remains essentially constant.
-
Occasionally several GB of RAM are released, but these releases do not correlate with:
-
epoch boundaries
-
seed boundaries
-
trial boundaries.
-
Training setup
The issue was originally observed during Optuna hyperparameter optimization studies.
The workflow is structured as follows:
-
One Optuna trial corresponds to one hyperparameter configuration.
-
Each trial is evaluated using multiple random seeds.
-
For each seed, a new model instance and optimizer are created and trained.
-
After each seed, the model and optimizer are deleted.
-
After each trial, the corresponding DataLoaders are deleted.
The RAM growth is observed both:
-
during individual model training runs, and
-
across longer Optuna studies.
However, large RAM releases do not appear to correlate with the end of a seed or the end of a trial.
Models tested
Original encoder-decoder LSTM
Stacked LSTM encoder with a small decoder.
Seq2Seq model
Tested in both modes:
-
autoregressive decoder
-
non-autoregressive decoder
Minimal reproduction
A simple:
LSTM → Linear
model without any encoder-decoder logic.
The minimal model still shows the issue.
What has already been ruled out
Backward pass
The RAM growth still occurs when training is replaced by:
with torch.no_grad():
out = model(x_batch)
So no gradients are being computed.
Activations
Tested:
-
ReLU
-
LeakyReLU
-
Tanh
No significant effect on the underlying issue.
Dropout
Removing dropout does not eliminate the problem.
Python object leak
I monitored:
len(gc.get_objects())
Values remain essentially constant:
1145931
1146441
1145933
1146439
1146949
...
No monotonic growth.
tracemalloc
The largest allocations reported are only in the KB range.
No Python-side allocation explains the observed GB-scale RAM growth.
Most interesting finding
Disabling cuDNN almost eliminates the issue:
torch.backends.cudnn.enabled = False
CPU RAM growth
Minimal LSTM:
-
cuDNN=True: ~0.017 GB / epoch
-
cuDNN=False: ~0.00065 GB / epoch
Seq2Seq models show a similar reduction.
Performance impact
Average epoch time:
Minimal LSTM:
-
cuDNN=True: ~28 s / epoch
-
cuDNN=False: ~61 s / epoch
Speed penalty factor ≈ 2.2
For larger models:
-
non-autoregressive Seq2Seq: ~2.0× slower
-
autoregressive Seq2Seq: ~1.4× slower
Because of the performance loss, disabling cuDNN is not a practical solution.
Example RAM log
Autoregressive Seq2Seq:
Epoch 1 end: RAM 17.07 GB
Epoch 2 end: RAM 17.17 GB
Epoch 3 end: RAM 17.28 GB
Epoch 4 end: RAM 17.38 GB
Epoch 5 end: RAM 17.48 GB
Epoch 6 end: RAM 17.58 GB
Epoch 7 end: RAM 17.69 GB
Epoch 8 end: RAM 17.79 GB
Epoch 9 end: RAM 17.89 GB
Available system RAM decreases accordingly.
GPU allocated and reserved memory remain almost unchanged.
Cleanup attempts
After each seed:
del model, optimizer
torch.cuda.synchronize()
torch.cuda.empty_cache()
gc.collect()
After each trial:
del train_loader, val_loader
torch.cuda.synchronize()
torch.cuda.empty_cache()
gc.collect()
These cleanup steps do not noticeably affect the RAM growth.
Questions
-
Has anyone observed similar behavior with:
-
PyTorch 2.7.x
-
CUDA 12.8
-
cuDNN 9.x
-
LSTM/RNN layers?
-
-
Is this a known issue in the cuDNN RNN backend?
-
Are there recommended version combinations (PyTorch / CUDA / cuDNN) that avoid this behavior?
-
Is there a way to keep cuDNN enabled while avoiding the continuous CPU RAM growth?
Any suggestions or ideas would be greatly appreciated.