Automatic Mixed Precision increases max memory used by tensors

I followed the tutorial: AUTOMATIC MIXED PRECISION and found it can’t reduce the network’s memory footprint.

Here are the hyperparams used and the results:

batch_size = 512
in_size = 4096
out_size = 4096
num_layers = 20
num_batches = 100
epochs = 3

=== results ===
Default precision:
Total execution time = 15.974 sec
Max memory used by tensors = 4773790720 bytes

Mixed precision:
Total execution time = 13.834 sec
Max memory used by tensors = 5268703744 bytes

The calculations are on GPU (RTX 3080).
I can’t understand why mixed precision use more memory than default precision.

Thanks for your try!
I think we organize code in different ways. The py scripts I tested are:

1. utils.py

import gc
import time

import torch

start_time = None


def start_timer():
    global start_time
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.synchronize()
    start_time = time.time()


def end_timer_and_print(local_msg):
    torch.cuda.synchronize()
    end_time = time.time()
    print("\n" + local_msg)
    print("Total execution time = {:.3f} sec".format(end_time - start_time))
    print("Max memory used by tensors = {} bytes".format(torch.cuda.max_memory_allocated()))


def make_model(in_size, out_size, num_layers):
    layers = []
    for _ in range(num_layers - 1):
        layers.append(torch.nn.Linear(in_size, in_size))
        layers.append(torch.nn.ReLU())
    layers.append(torch.nn.Linear(in_size, out_size))
    return torch.nn.Sequential(*tuple(layers)).cuda()


batch_size = 512
in_size = 4096
out_size = 4096
num_layers = 3
num_batches = 50
epochs = 3

data = [torch.randn(batch_size, in_size, device="cuda") for _ in range(num_batches)]
targets = [torch.randn(batch_size, out_size, device="cuda") for _ in range(num_batches)]

loss_fn = torch.nn.MSELoss().cuda()
net = make_model(in_size, out_size, num_layers).to('cuda')
opt = torch.optim.SGD(net.parameters(), lr=0.001)

2. precision_default.py

from utils import *

# ====== Default Precision ======
start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad()  
end_timer_and_print("Default precision:")

3. precision_auto_mix.py

from utils import *

use_amp = True

scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad()
end_timer_and_print("Mixed precision:")

I run 2 training scripts precision_default.py and precision_auto_mix.py respectively, and got:

Default precision:
Total execution time = 1.527 sec
Max memory used by tensors = 1367458816 bytes
Mixed precision:
Total execution time = 1.299 sec
Max memory used by tensors = 1434552832 bytes

In my codes, there are no intermediate variables, right? I am curious why your default training’s max memory is 3775580672 bytes, which is much larger than mine.

I don’t whether the warning is the cause of this phenomenon:

FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Sorry, I was wrong in my message, I’ve messed up between your parameters and the one’s from the tutorial.

I just did the exact thing as you, and I can at least confirm your results (and got same warning), here mine:

Default precision:
Total execution time = 1.328 sec
Max memory used by tensors = 1367458816 bytes

Mixed precision:
Total execution time = 1.236 sec
Max memory used by tensors = 1434552832 bytes

Thank you for your kindness, my friend!

The allocated memory should decrease while using amp, while the max. allocated memory might yield a higher peak, e.g. if the transformed parameters are stored additionally in the cache.
If depends on the model and you should not run OOM using amp while the vanilla training is working.

Hi ptrblck. Can you give some explanations or examples about “run OOM using amp”? I am not very clear about it.

Sorry, I meant you should not be running out of memory when using mixed-precision training via amp.

Ok, I get it now, thanks!

Hi. I ran the same script, and it does go out of memory with AMP, while it doesn’t with FP32. Comparing the allocated memory, AMP only reduces it by 4 MB (less than 1%).

I started a new topic at Increased memory usage with AMP but I can move it here if required. Thanks in advance.