Why example for torch amp show incorrect results?

Hello everyone!

I have rtx 3080 and I would like to train Bert.

But when I start example for amp from Automatic Mixed Precision — PyTorch Tutorials 1.9.1+cu102 documentation I get that without amp model trains faster than with amp.

Can someone help me understand why?

For large batch size amp is faster, but it use more memory of card.

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
​
start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Default precision:")

Default precision:
Total execution time = 1.803 sec
Max memory used by tensors = 826428416 bytes

use_amp = True
​
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)
​
start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Mixed precision:")

Mixed precision:
Total execution time = 2.268 sec
Max memory used by tensors = 925018112 bytes

def make_model(in_size, out_size, num_layers):
    layers = []
    for _ in range(num_layers - 1):
        layers.append(torch.nn.Linear(in_size, in_size))
        layers.append(torch.nn.ReLU())
    layers.append(torch.nn.Linear(in_size, out_size))
    return torch.nn.Sequential(*tuple(layers)).cuda()

batch_size = 128 # Try, for example, 128, 256, 513.
in_size = 4096
out_size = 4096
num_layers = 3
num_batches = 50
epochs = 10

# Creates data in default precision.
# The same data is used for both default and mixed precision trials below.
# You don't need to manually change inputs' dtype when enabling mixed precision.
data = [torch.randn(batch_size, in_size, device="cuda") for _ in range(num_batches)]
targets = [torch.randn(batch_size, out_size, device="cuda") for _ in range(num_batches)]

loss_fn = torch.nn.MSELoss().cuda()

I also wanted to ask. Is there any way to reduce the memory consumption for the BERT model?
For model I use checkpoint from utils and accumulation step for backward.
I can use only batch size = 32 and have 4-5 iteration per second.
I think rtx 3080 is slow card for deep learning(((

I don’t know how you’ve checked the memory usage, but the max. peak memory could be higher in case kernels are profiled internally, which might require more workspace.

1 Like

Thanks for your reply. I am looking at nvidia-smi, also.
Sorry, I have little experience. I have verified that amp does work, although it does not provide any memory gain. Only the checkpoint helps to somehow reduce memory consumption. I had to delete more than 40% of the dataset already because the sequence is long (64 tokens) and batch size 16.

Unfortunately, the checkpoint will not help train the layers of the transformer.

If you can tell me what else can be done, besides the checkpoint and amp, I would be very grateful.
I have a feeling that I am doing something very wrong. :sweat_smile:

I read Performance Tuning Guide — PyTorch Tutorials 1.9.1+cu102 documentation and Training With Mixed Precision :: NVIDIA Deep Learning Performance Documentation