Continuous CPU RAM growth during LSTM training with cuDNN enabled (PyTorch 2.7.1 / CUDA 12.8 / cuDNN 9.10.0.2)

Hi everyone,

I am investigating a persistent CPU RAM growth issue during neural network training and would appreciate any ideas or suggestions.

The behavior has been reproduced:

  • on two different machines,

  • with three different model architectures,

  • including a minimal LSTM → Linear model.

Environment

  • PyTorch: 2.7.1

  • CUDA: 12.8

  • cuDNN: 9.10.0.2 (reported by torch.backends.cudnn.version())

  • Python: 3.13.3

The issue has been reproduced on two different systems:

System 1

  • Windows VM running on a Linux server

  • NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

  • 128 GB GPU memory

System 2

  • Native Windows 11 machine

  • NVIDIA RTX 6000 Ada Generation

  • 111 GB GPU memory

The behavior is visible on both systems.


Observed behavior

CPU RAM usage continuously increases during training.

Important observations:

  1. RAM growth occurs during a single training run.

  2. RAM grows continuously rather than in large jumps.

  3. Available system RAM decreases accordingly (this is not only an RSS reporting artifact).

  4. GPU utilization remains high.

  5. GPU memory usage remains essentially constant.

  6. Occasionally several GB of RAM are released, but these releases do not correlate with:

    • epoch boundaries

    • seed boundaries

    • trial boundaries.


Training setup

The issue was originally observed during Optuna hyperparameter optimization studies.

The workflow is structured as follows:

  • One Optuna trial corresponds to one hyperparameter configuration.

  • Each trial is evaluated using multiple random seeds.

  • For each seed, a new model instance and optimizer are created and trained.

  • After each seed, the model and optimizer are deleted.

  • After each trial, the corresponding DataLoaders are deleted.

The RAM growth is observed both:

  • during individual model training runs, and

  • across longer Optuna studies.

However, large RAM releases do not appear to correlate with the end of a seed or the end of a trial.


Models tested

Original encoder-decoder LSTM

Stacked LSTM encoder with a small decoder.

Seq2Seq model

Tested in both modes:

  • autoregressive decoder

  • non-autoregressive decoder

Minimal reproduction

A simple:

LSTM → Linear

model without any encoder-decoder logic.

The minimal model still shows the issue.


What has already been ruled out

Backward pass

The RAM growth still occurs when training is replaced by:

with torch.no_grad():
    out = model(x_batch)

So no gradients are being computed.

Activations

Tested:

  • ReLU

  • LeakyReLU

  • Tanh

No significant effect on the underlying issue.

Dropout

Removing dropout does not eliminate the problem.

Python object leak

I monitored:

len(gc.get_objects())

Values remain essentially constant:

1145931
1146441
1145933
1146439
1146949
...

No monotonic growth.

tracemalloc

The largest allocations reported are only in the KB range.

No Python-side allocation explains the observed GB-scale RAM growth.


Most interesting finding

Disabling cuDNN almost eliminates the issue:

torch.backends.cudnn.enabled = False

CPU RAM growth

Minimal LSTM:

  • cuDNN=True: ~0.017 GB / epoch

  • cuDNN=False: ~0.00065 GB / epoch

Seq2Seq models show a similar reduction.

Performance impact

Average epoch time:

Minimal LSTM:

  • cuDNN=True: ~28 s / epoch

  • cuDNN=False: ~61 s / epoch

Speed penalty factor ≈ 2.2

For larger models:

  • non-autoregressive Seq2Seq: ~2.0× slower

  • autoregressive Seq2Seq: ~1.4× slower

Because of the performance loss, disabling cuDNN is not a practical solution.


Example RAM log

Autoregressive Seq2Seq:

Epoch 1 end: RAM 17.07 GB
Epoch 2 end: RAM 17.17 GB
Epoch 3 end: RAM 17.28 GB
Epoch 4 end: RAM 17.38 GB
Epoch 5 end: RAM 17.48 GB
Epoch 6 end: RAM 17.58 GB
Epoch 7 end: RAM 17.69 GB
Epoch 8 end: RAM 17.79 GB
Epoch 9 end: RAM 17.89 GB

Available system RAM decreases accordingly.

GPU allocated and reserved memory remain almost unchanged.


Cleanup attempts

After each seed:

del model, optimizer

torch.cuda.synchronize()
torch.cuda.empty_cache()
gc.collect()

After each trial:

del train_loader, val_loader

torch.cuda.synchronize()
torch.cuda.empty_cache()
gc.collect()

These cleanup steps do not noticeably affect the RAM growth.


Questions

  1. Has anyone observed similar behavior with:

    • PyTorch 2.7.x

    • CUDA 12.8

    • cuDNN 9.x

    • LSTM/RNN layers?

  2. Is this a known issue in the cuDNN RNN backend?

  3. Are there recommended version combinations (PyTorch / CUDA / cuDNN) that avoid this behavior?

  4. Is there a way to keep cuDNN enabled while avoiding the continuous CPU RAM growth?

Any suggestions or ideas would be greatly appreciated.