Increased memory usage with AMP

serkansulun · June 30, 2021, 6:44pm

Hi. I’m benchmarking automatic mixed precision vs. default mode (float32). I’m getting a speed-up but, the memory usage is the same, if not higher. I’m running Pytorch’s tutorial with minimal change:

import torch, time, gc
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Switching between two
use_amp = True
# use_amp = False

start_time = None

def start_timer():
    global start_time
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.synchronize()
    start_time = time.time()

def end_timer_and_print(local_msg):
    torch.cuda.synchronize()
    end_time = time.time()
    print("\n" + local_msg)
    print("Total execution time = {:.3f} sec".format(end_time - start_time))
    print(f'Memory allocated {torch.cuda.memory_allocated() // (1024**2)} MB')
    print(f'Max memory allocated {torch.cuda.max_memory_allocated() // (1024**2)} MB')
    print(f'Memory reserved {torch.cuda.memory_reserved() // (1024**2)} MB')
    print(f'Max memory reserved {torch.cuda.max_memory_reserved() // (1024**2)} MB')

def make_model(in_size, out_size, num_layers):
    layers = []
    for _ in range(num_layers - 1):
        layers.append(torch.nn.Linear(in_size, in_size))
        layers.append(torch.nn.ReLU())
    layers.append(torch.nn.Linear(in_size, out_size))
    return torch.nn.Sequential(*tuple(layers)).cuda()

epochs = 1
num_batches = 50
batch_size = 512 # Try, for example, 128, 256, 513.

in_size = 4096
out_size = 4096
num_layers = 16

# in_size = 8192
# out_size = 8192
# num_layers = 32

data = [torch.randn(batch_size, in_size, device="cuda") for _ in range(num_batches)]
targets = [torch.randn(batch_size, out_size, device="cuda") for _ in range(num_batches)]

loss_fn = torch.nn.MSELoss().cuda()

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)

scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
message = "Mixed precision:" if use_amp else "Default precision:"
end_timer_and_print(message)

Outputs:

Default precision:
Total execution time = 3.553 sec
Memory allocated 2856 MB
Max memory allocated 3176 MB
Memory reserved 3454 MB
Max memory reserved 3454 MB
# nvidia-smi shows 4900 MB

Mixed precision:
Total execution time = 1.652 sec
Memory allocated 2852 MB
Max memory allocated 3520 MB
Memory reserved 3646 MB
Max memory reserved 3646 MB
# nvidia-smi shows 5092

When I try to saturate the GPU (RTX 6000 with 24 GB memory) using different hyperparameters, default mode works, but AMP goes out of memory:

in_size = 4096
out_size = 4096
num_layers = 16

Outputs:

Default precision:
Total execution time = 29.503 sec
Memory allocated 18002 MB
Max memory allocated 19282 MB
Memory reserved 19284 MB
Max memory reserved 19284 MB
# nvidia-smi shows 20730 MB

# Mixed precision goes out of memory:
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 23.65 GiB total capacity; 22.08 GiB already allocated; 161.44 MiB free; 22.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Torch version: 1.10.0.dev20210630+cu113
CUDA version: 11.3
GPU model: NVIDIA Quadro RTX 6000

Thanks in advance!

ptisseur · November 25, 2021, 10:24am

Hi, I’ve just try amp with pytorch yesterday with a Pascal gtx 1070.
I just which to “extend the gpu vram” using mixed precision.
Following the tutorial and increasing different parameters i saw that mixed precision is slower (for the Pascal GPU which seems normal) but the memory usage is higher with that GPU.
To verify that the : torch.cuda.max_memory_allocated() value represents really the memory usage, i found parameters for which the fp32 finishs the job but the mixed precision gives a RuntimeError: CUDA out of memory error.
I just try different values for the batch size and find a special value of 3860 that gives memory error only for the mixed precision.
In that case : the script shows : Default precision:
Total execution time = 42.469 sec
Max memory used by tensors = 7092684288 bytes
File “C:\Users\33661\miniconda3\envs\torch\lib\site-packages\torch\autograd_init_.py”, line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 8.00 GiB total capacity; 6.59 GiB already allocated; 0 bytes free; 6.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It was pytorch 1.10 with cuda 11.3 and cudnn 8.2
I try with pytorch 1.7.1 ; cuda 10.1 and cudnn 7.6 with analog results.

I would like to know which version of pytorch, cuda and cudnn (maybe not important?) allow to show a reduction of memory usage with the tutorial script which is well done.

Maybe this problem is due to cuda version?

I try some basic convolutional neural network and always the mixed precision use more gpu vram.
I wonder if the mixed precision for the moment does its principle job : allow to use bigger networks and batchsize.
I will try older pytorch versions and will try to use apex.amd to see if theses problem always remained.

I test tensorflow.keras for small convolutional nets with mnist data and always find memory usage reduction with mixed precision. The speed could decrease for small networks with the gtx 1070 without tensor core.

I find the pytorch tutorial marvelous and love pytorch in general but if i need mixed precision i will use surely another framework.

serkansulun · November 25, 2021, 12:43pm

I got a response on my open issue on Github: Increased memory usage with AMP · Issue #61173 · pytorch/pytorch · GitHub
I guess basically, because the model parameters are still kept in fp32, we don’t see reduction in memory usage while using large models. The reduction comes from not keeping the activations in fp32, so maybe try a smaller model with larger batch size.

ptrblck · November 29, 2021, 2:59am

As you’ve pointed out, the memory savings depend on the model architecture and e.g. the ratio f parameters to activations.
This post breaks it down for some common use cases.

cfangplus · October 31, 2024, 3:32am

hi ptrblck, I searched many posts about pytorch amp memory problem like this post said, the memory with amp doesn’t decrease compared that without amp. You provided many API or tools like forward hood to check the memory stat info. However, I still confused, I still cannot understand why amp increase or not decrease memory. As some people said, amp still matain fp32 model parameter, that’s ok, but as you said the activation memory usage indeed times more memory than model parameter and gradient and I believe activation memory can benefit from amp but both in the pytorch amp recipe and my case failed.
In my case, a Transformer Model, I tested three pytorch amp like methods including pytorch amp, pytorch fsdp amp and deepspeed amp and the result is that the latter two methods both decreased memory to nearly 50%, only pytorch amp doesn’t. Also, I use pytorch activation checkpointing API and the memory decreaced nearly 60% which means in this case the model parameter and gradient memory takes less than activation memory.
Finally I think actually the total memory with amp should decrease nearly 50%, right?

ptrblck · October 31, 2024, 12:43pm

Tha latter ones use bfloat16 parameters directly if I’m not mistaken so did you check their implementation and where the savings come from?

cfangplus · November 5, 2024, 9:10am

oh yes, deepspeed provides three methods to enable mixed precision, including fp16, bp16 and amp, it seems that the former two methods acts like nv apex O2 mode and memory use is cut to nearly 50% but the latter one acts same with torch amp, that’s to say the memory used of both deepspeed amp and torch amp doesn’t decrease.