Freeing gradients memory after optimizer step

omarfoq · March 29, 2021, 1:48pm

I am training multiple models in a sequential way on the same GPU, and I need them to share the parameters after a given number of iterations. For GPU sonsumption optimization I need to free the gradients of each model at the end of each optimizer iteration. A simple solution is to set all gradients to None manually, i.e.,

for param in model.parameters():
    param.grad = None

Is this is a good practice? If not what are the alternatives

Thanks in advance

albanD · March 29, 2021, 1:59pm

Hi,

Depending on the particular model and training loop, it may improve perf and not.
Note that a simpler way to do this is via the regular zero grad: model.zero_grad(set_to_none=True).

omarfoq · March 29, 2021, 2:07pm

I think in the current scenario I don’t have enough memory if I keep all gradients. I think, however that setting gradients to None well decrease the training speed, because we will need to allocate new memory for gradients after each time we set them to None, Am I correct? In that case, how much do you think this will cost in term of time.

Also her the training loop is very standard, and looks like

        for x, y, indices in iterator:
            x = x.to(self.device).type(torch.float32)
            y = y.to(self.device)

            if self.is_binary_classification:
                y = y.type(torch.float32)

            self.optimizer.zero_grad()

            y_pred = self.model(x).squeeze()

            loss_vec = self.criterion(y_pred, y)
            if weights is not None:
                weights = weights.to(self.device)
                loss = (loss_vec.T @ weights[indices]) / loss_vec.size(0)
            else:
                loss = loss_vec.mean()
            loss.backward()

Rq: criterion is initialized with reduction=None.

albanD · March 29, 2021, 2:23pm

Setting gradients to None will not necessarily slow down training as we have some optimizations in place to avoid re-allocating the gradient buffer and just re-use the intermediary buffers from the backward.
But that does not work all the time, depending on many factors.
So you will have to try out for your model to know the exact impact.

omarfoq · March 29, 2021, 2:40pm

Hello @albanD,

I have an other different, yet closely related, question. Normally when freeing gradients, I expect the used memory to be freed, and go back to the initial state before running backpropagation, however this is not what I observe, so I imagine memory is still allocate, what do you thin is the reason behind this?

albanD · March 29, 2021, 2:43pm

How do you check the memory?
PyTorch uses a custom allocator to speed up GPU allocations. So it is expected that the memory in nvidia-smi doesn’t go down. You can use torch.cuda.memory_allocated() to see the memory that is actually used by Tensors.

omarfoq · March 29, 2021, 2:46pm

Indeed I use nvidia-smi to get memory consumption. My guess is that the memory consumption shown by nvidia-smi is the one that matters, in the sense I don’t want it to exceed memory capacity, what happens now is that when I free gradients by setting them to None, it doesn’t solve the issue I had, since the memory kept allocated, is there a way to force freeing memory?

omarfoq · March 29, 2021, 2:59pm

Just an update on this, I guess the proper way is to free optimizer instead of the model, so it should be
optimizer.zero_grad(set_to_none=True), otherwisze the optimizer will keep the reference to the gradients, and thus they will keep place in memory.

albanD · March 29, 2021, 3:06pm

Well that memory is available to allocate more Tensors (even though other processes can’t use it). So if you’re only using the GPU for pytorch, then its doing what you want.

zeroing from the optimizer or model does the same thing.

omarfoq · March 29, 2021, 6:43pm

Hello @albanD,

Do you know how to free gradients in the case of LSTM, for some reason when I use LSTM and I free gradients setting them to 0, a part of memory is still used. What do you think is the reason behind this?

albanD · March 29, 2021, 6:46pm

You mean setting them to None?

Not sure why LSTM would be different.

omarfoq · March 29, 2021, 7:03pm

yeah I mean stetting them to None, The following outputs

905216
2793472
1888256

Do you think it has something to due with LSTM memory leak

Here is the code,

import torch
import torch.nn as nn
import torch.optim as optim 
import string


class MyLSTM(nn.Module):
    def __init__(self, input_size, embed_size, hidden_size, output_size, n_layers):
        super(MyLSTM, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embed_size = embed_size
        self.output_size = output_size
        self.n_layers = n_layers

        self.encoder = nn.Embedding(input_size, embed_size)
        self.rnn = nn.LSTM(input_size=embed_size,
                                       hidden_size=hidden_size,
                                      num_layers=n_layers,
                                     batch_first=True,
                                    dropout=0.5)
        self.decoder = nn.Linear(hidden_size, output_size)

    def forward(self, input_):
        encoded = self.encoder(input_)
        output = self.rnn(encoded)
        output = self.decoder(output)
        output = output.permute(0, 2, 1)  # change dimension to (B, C, T)
        return output


device = torch.device("cuda")

model =\
    NextCharacterLSTM(
        input_size=100,
        embed_size=8,
        hidden_size=256,
        output_size=100,
        n_layers=1
    ).to(device)

criterion = nn.CrossEntropyLoss().to(device)

def fit_epoch(model, weights=None):

    model.train()

    for _ in range(10):

        x = torch.zeros(32, 80).type(torch.long)
        y = torch.zeros(32, 100, 80)

        x = x.to(device)
        y = y.to(device)

        model.zero_grad()

        y_pred = model(x).squeeze()

        loss = criterion(y_pred, y)

        loss.backward()


print(torch.cuda.memory_allocated())
fit_epoch(model)
print(torch.cuda.memory_allocated())
model.zero_grad(set_to_none=True)
print(torch.cuda.memory_allocated())

albanD · March 29, 2021, 8:25pm

I am not sure your code is correct, I had to modify a couple things to make it work:

import torch
import torch.nn as nn
import torch.optim as optim 
import string


class MyLSTM(nn.Module):
    def __init__(self, input_size, embed_size, hidden_size, output_size, n_layers):
        super(MyLSTM, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embed_size = embed_size
        self.output_size = output_size
        self.n_layers = n_layers

        self.encoder = nn.Embedding(input_size, embed_size)
        self.rnn = nn.LSTM(input_size=embed_size,
                                       hidden_size=hidden_size,
                                      num_layers=n_layers,
                                     batch_first=True,
                                    dropout=0.5)
        self.decoder = nn.Linear(hidden_size, output_size)

    def forward(self, input_):
        encoded = self.encoder(input_)
        output, _ = self.rnn(encoded)
        output = self.decoder(output)
        output = output.permute(0, 2, 1)  # change dimension to (B, C, T)
        return output


device = torch.device("cuda")

model =\
    MyLSTM(
        input_size=100,
        embed_size=8,
        hidden_size=256,
        output_size=100,
        n_layers=1
    ).to(device)

criterion = nn.CrossEntropyLoss().to(device)

def fit_epoch(model, weights=None):

    model.train()

    for _ in range(10):

        x = torch.zeros(32, 80, dtype=torch.long)
        y = torch.zeros(32, 80, dtype=torch.long)

        x = x.to(device)
        y = y.to(device)

        model.zero_grad()

        y_pred = model(x).squeeze()

        loss = criterion(y_pred, y)

        loss.backward()


print(torch.cuda.memory_allocated())
fit_epoch(model)
print(torch.cuda.memory_allocated())
model.zero_grad(set_to_none=True)
print(torch.cuda.memory_allocated())

But then running this on colab gives me
108730368
110368768
108730368

Which is what we expect right?

omarfoq · March 29, 2021, 8:40pm

You are right, I had some typos in the previous code (I edited when writing the answer, I am sorry for that). But still the problem is that if you run this code in the first time it will give:

1196032
3858432
2179072

Then if you un it a second time it will give:

2179072
3817472
2179072

But I think this is not related to gradients at all, I observe the same thing when running with torch.no_grad(). Apparently there is some part of the memory that is allocated when doing the forward pass, I guess that this is part of memory allocated by the hidden state in the case of RNN, what do you think?

Here is a link if you want to try it directly.

albanD · April 5, 2021, 1:17pm

I am not super familiar with how LSTM work in details. But it is indeed possible that the default hidden state is lazily initialized the first time it is used.