Why torch.set_deterministic(True) reduce backward memory usage?

b02202050 · February 25, 2021, 3:38am

Hi all,

I recently tried torch.set_deterministic(True) and observe that it could reduce the GPU memory usage of backward!
If I use mixed precision, the memory won’t reduce by torch.set_deterministic(True).
Can anyone tell me why these happen?
(My torch version: 1.7.0)

*** The code to reproduce my result is as the following.
You can turn on/off use_amp and deterministic ***

import torch
import torchvision
device = 'cuda'
use_amp = True
deterministic = False

# Set deterministic
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8' # Deterministic behavior of torch.addmm. Please refer to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
torch.set_deterministic(deterministic)

# Initialize model
model = torchvision.models.resnet50().to(device)
optim = torch.optim.Adam(model.parameters(), lr=1e-4)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

# Forward
x = torch.rand(16, 3, 800, 800).to(device)
with torch.cuda.amp.autocast(enabled=use_amp):
    loss = model(x).sum()

# Backward
optim.zero_grad()
loss.backward()
optim.step()

print(torch.cuda.max_memory_allocated() / 1024 ** 2)

Thx!

ptrblck · February 25, 2021, 7:37am

The memory usage could be different based on e.g. the picked cudnn algorithms, i.e. a deterministic one could use less memory but might also be slower.

b02202050 · February 25, 2021, 8:38am

Thanks!
But why set_deterministic seems no effect on speed or memory if I use amp?

ptrblck · February 25, 2021, 8:49am

The kernel selection might pick the same algorithm in both cases, if you are using amp.