VAE-GAN: INPLACE operation error

pankratozzi · February 11, 2022, 9:33am

I’m new in PyTorch. I’m implementing simple VAE-GAN model, based on this great notebook: https://www.kaggle.com/carloalbertobarbano/faceswap-trump-in-a-cage/notebook. During training Autoencoder (generator) model I always get the same error:

“RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048, 1024]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).”

Here is the part of my code that raises this error (one step training functions):

def train_discriminator(D, criterion, optimizer, real, fake):
    optimizer.zero_grad()
    
    with torch.set_grad_enabled(True):
        pred_real = D(real)
        pred_fake = D(fake)
        
        loss_real = criterion(pred_real, torch.ones(real.size(0), 1).to(device))
        loss_fake = criterion(pred_fake, torch.zeros(fake.size(0), 1).to(device))
        
        loss = loss_real + loss_fake
        loss.backward(retain_graph=True)

        optimizer.step()
    
    return loss.item()

def train_generator(D, criterion_G, criterion_D, optimizer, x, fake, mu, logvar):
    optimizer.zero_grad()

    with torch.set_grad_enabled(True):
        prediction = D(fake)
        #before = list(D.parameters())[0].clone()

        target = torch.ones(x.size(0), 1).to(device)
        d_loss = criterion_D(prediction, target)
        d_loss.backward(retain_graph=True)  # EXCEPTION raises here
        d_loss = d_loss.item()
        
        if criterion_G is not None:
            g_loss = criterion_G(fake, x, mu, logvar)
            grads = torch.ones_like(g_loss)
            g_loss.backward(grads, retain_graph=True)
            d_loss += g_loss.mean().item()
    
        optimizer.step()
    #after = list(D.parameters())[0].clone()
    #print(torch.equal(before.data, after.data))

    return d_loss

Training loop:

model.train()
D_A.train()
D_B.train()

loss_hist = {'D': list(), 'G': list()}

for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')

    g_loss = 0
    d_loss = 0

    for i, batch in enumerate(tqdm(dataloader, leave=False), 1):
        x_A, x_B = batch
        
        fake_A, mu_A, logvar_A = model(x_A)
        fake_B, mu_B, logvar_B = model(x_B, select='B')
        
        d_loss += train_discriminator(D_A, criterion_D, optimizerD_A, x_A, fake_A)
        d_loss += train_discriminator(D_B, criterion_D, optimizerD_B, x_B, fake_B)

        g_loss += train_generator(D_A, criterion, criterion_D, optimizerA, x_A, fake_A, mu_A, logvar_A)   
        g_loss += train_generator(D_B, criterion, criterion_D, optimizerB, x_B, fake_B, mu_B, logvar_B)  # EXCEPTION raises here, inplace op error

        fake_A2, mu_A2, logvar_A2 = model(x_A, select='B')
        fake_B2, mu_B2, logvar_B2 = model(x_B, select='A')

        g_loss += train_generator(D_B, None, criterion_D, optimizerB, x_A, fake_A2, mu_A2, logvar_A2)
        g_loss += train_generator(D_A, None, criterion_D, optimizerA, x_B, fake_B2, mu_B2, logvar_B2)

    d_loss /= i*2
    g_loss /= i*4

    loss_hist['D'].append(d_loss)
    loss_hist['G'].append(g_loss)

    print(f'Epoch g_loss: {g_loss:.4f}, d_loss: {d_loss:.4f}')
    print(50*'-')

    early(d_loss+g_loss, epoch=epoch, model=model, D_A=D_A, D_B=D_B, optimizerA=optimizerA, optimizerB=optimizerB, 
          optimizerD_A=optimizerD_A, optimizerD_B=optimizerD_B, )
    if early.early_stop:
        print(f'Train loss did not improve for {early.patience} epochs. Training stopped.')
        model, D_A, D_B, optimizerA, optimizerB, optimizerD_A, optimizerD_B, _, early = load_model(PATH)
        break

I suppose, that the problem might be in using the computation graph multiple times. I’ve tried almost everything (setting retain_graph=False, using .clone() with different tensors, detaching different tensors, etc.), but I still can’t figure out where this inplace operation took place and how to avoid it.

For interested readers here is the full code (if you have issues to open it, I can upload it anywhere you prefer): Colab notebook

I’m stuck and confused. I would appreciate any suggestions, this is my Everest for now, help me, please, to conquer it

AlphaBetaGamma96 · February 12, 2022, 12:32pm

I don’t have any experience in using VAE-GAN’s but after having a brief look through your code there I can see numerous uses of inplace operators that are probably causing the issue.

Within PyTorch, using inplace operator break the computational graph and basically results in Autograd failing in getting your gradients. Inplace operators within PyTorch are denoted with an _, for example mul does elementwise multiplciation where mul_ does elementwise multiplication inplace. So avoid those commands.

Other inplace operators are += and /= you’ve used those commands a few times throughout your training loop, discriminator, and generator. So, you need to replace those with out-of-place operations instead, for example,

d_loss += train_discriminator(D_A, criterion_D, optimizerD_A, x_A, fake_A)

should be replaced with

d_loss_minibatch = train_discriminator(D_A, criterion_D, optimizerD_A, x_A, fake_A)
d_loss = d_loss + d_loss_minibatch

Try replacing all in-place operators with the out-of-place equivalent and see if your error goes away!

pankratozzi · February 12, 2022, 1:09pm

Thank you so much for your quick reply!
I’ve already thought and tried to replace all such " +=, /=" operations, also add_, mul_ and so on. Still getting error.
In a training loop and everywhere else I summarize items from tensor (g_loss.item() …)
But I’ve tried making it in every possible place either - got same error.
By now I realize that the problem is in mu and logvar tensors, but I can’t figure out where exactly. I’ve changed loss function (that uses mu and logvar)

class KLDLoss(nn.Module):
    def forward(self, mu, logvar):
        return -0.5 * torch.sum(1 + logvar.clone() - torch.pow(mu, 2) - torch.exp(logvar))

so that there might not be any inplace operations - and finally a few training iterations ended without error! BUT. after I restarted the GPU environment in Colab for some reasons I again meet the same error So the error is unstable!
The error gone completely when I modified training loop something like that (recalculating needed tensors via updated model…):

...

    for i, batch in enumerate(tqdm(dataloader, leave=False), 1):
        x_A, x_B = batch
        
        fake_A, mu_A, logvar_A = model(x_A)
        fake_B, mu_B, logvar_B = model(x_B, select='B')
        
        d_loss += train_discriminator(D_A, criterion_D, optimizerD_A, x_A, fake_A)
        d_loss += train_discriminator(D_B, criterion_D, optimizerD_B, x_B, fake_B)

        g_loss += train_generator(D_A, criterion, criterion_D, optimizerA, x_A, fake_A, mu_A, logvar_A)
        fake_B, mu_B, logvar_B = model(x_B, select='B') ## mu and logvar raises inplace error otherwise
        g_loss += train_generator(D_B, criterion, criterion_D, optimizerB, x_B, fake_B, mu_B, logvar_B)

        fake_A2, mu_A2, logvar_A2 = model(x_A, select='B')
        fake_B2, mu_B2, logvar_B2 = model(x_B, select='A')

        g_loss += train_generator(D_B, None, criterion_D, optimizerB, x_A, fake_A2, mu_A2, logvar_A2)
        fake_B2, mu_B2, logvar_B2 = model(x_B, select='A') ##
        g_loss += train_generator(D_A, None, criterion_D, optimizerA, x_B, fake_B2, mu_B2, logvar_B2)
 ...

But it feels like it is not actually correct approach. And it is better to find some other solvation

AlphaBetaGamma96 · February 12, 2022, 2:06pm

three questions for your KLDLoss class,

why are you calling .clone() within the forward? It seems unnecessary.
You need to define the __init__ method as well when constructing your class
When you’ve constructed the KLDLoss class you need to call an instance of it, i.e.

kld_loss = KLDLoss() #create instance 
result = kld_loss(mu, logvar) #use the instance as your function

Also, still go through your code and remove all += and /= operations (as well all other in-place operations too) as they will be a problem eventually.

pankratozzi · February 12, 2022, 4:19pm

Again, thanks for reply!

why are you calling .clone() within the forward? It seems unnecessary.

You are right, it is absolutely unnecessary. I just gone crazy trying to prevent any even impossible inplace operations

You need to define the __init__ method as well when constructing your class

I thought that KLDLoss class is using default nn.Module init() method, but done!

When you’ve constructed the KLDLoss class you need to call an instance of it, …

I’m using it in this way.

reconstruction_loss = torch.nn.functional.binary_cross_entropy
kld_loss = KLDLoss()
criterion = lambda y, x, mu, logvar: reconstruction_loss(y, x, reduction='sum') + kld_loss(mu, logvar)

All +=, /=, etc. ops have been removed.

Maybe there’s a problem in this part in the model itself.

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5*logvar)
        eps = torch.randn_like(std)
        sample = mu + (eps * std) 
        return sample

    def forward(self, x, select='A'):
        x = self.encoder(x)
  
        mu, logvar = self.mean(x), self.std(x)
        z = self.reparameterize(mu, logvar)

        if select == 'A':
            y = self.decoder_A(z)
        else:
            y = self.decoder_B(z)
        
        return y, mu, logvar

where self.mean and self.std are:

self.mean = nn.Linear(2048, 1024)
self.std = nn.Linear(2048, 1024)

respectively.

AlphaBetaGamma96 · February 12, 2022, 5:02pm

When creating an nn.Module you need to initialise the parent class in order for it to work! That’s what __init__ does.

Ok, so given the error above it seems that your error is probably coming from self.mean and self.std., as they’re the same shape as [torch.cuda.FloatTensor [2048, 1024]].

Did you run your code with torch.autograd.set_detect_anomaly(True)? That will tell you where it’s crashing

pankratozzi · February 12, 2022, 5:51pm

Yes, I ran the cell with anomaly detection, but it just sends me back to the train_generator function:

—> 27 d_loss.backward(retain_graph=True)

The error than:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [2048, 1024]], which is output 0 of AsStridedBackward0, is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

the last sentence looks like they knew it would be tough situation

pankratozzi · February 12, 2022, 6:05pm

This issue looks like similar to mine:
inplace error in GAN
But I don’t have enough understanding to extrapolate this approach over my situation…

pankratozzi · February 13, 2022, 8:25am

I’m really sorry for disturbing you all the time, but I thought it might be interesting for you:
I’m not completely sure this is the correct approach in common but the following changes in code eliminates the given inplace ops error:

def switch_params(model, on=True):
    for param in model.parameters():
        param.requires_grad = on

...
def train_generator(D, criterion_G, criterion_D, optimizer, x, fake, mu, logvar):
    optimizer.zero_grad()

    with torch.set_grad_enabled(True):
        prediction = D(fake.detach())

        target = torch.ones(x.size(0), 1).to(device)
        d_loss = criterion_D(prediction, target)
        if criterion_G is not None:
            g_loss = criterion_G(fake, x, mu, logvar)
            total_loss = d_loss + g_loss.mean()
        else:
            total_loss = d_loss
        total_loss.backward(retain_graph=True)
    
        optimizer.step()

    return total_loss.item()

# the training loop

model.train()
D_A.train()
D_B.train()

loss_hist = {'D': list(), 'G': list()}

for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')

    g_loss = 0
    d_loss = 0

    for i, batch in enumerate(tqdm(dataloader, leave=False), 1):
        x_A, x_B = batch
        switch_params(model, False)
        fake_A, mu_A, logvar_A = model(x_A)
        fake_B, mu_B, logvar_B = model(x_B, select='B')

        switch_params(model, True)
        d_loss += train_discriminator(D_A, criterion_D, optimizerD_A, x_A, fake_A)
        d_loss += train_discriminator(D_B, criterion_D, optimizerD_B, x_B, fake_B)
        
        g_loss += train_generator(D_A, criterion, criterion_D, optimizerA, x_A, fake_A, mu_A, logvar_A)
        g_loss += train_generator(D_B, criterion, criterion_D, optimizerB, x_B, fake_B, mu_B, logvar_B)

        switch_params(model, False)
        fake_A2, mu_A2, logvar_A2 = model(x_A, select='B')
        fake_B2, mu_B2, logvar_B2 = model(x_B, select='A')

        switch_params(model, True)
        g_loss += train_generator(D_B, None, criterion_D, optimizerB, x_A, fake_A2, mu_A2, logvar_A2)
        g_loss += train_generator(D_A, None, criterion_D, optimizerA, x_B, fake_B2, mu_B2, logvar_B2)

    d_loss /= i*2
    g_loss /= i*4

    loss_hist['D'].append(d_loss)
    loss_hist['G'].append(g_loss)

    print(f'Epoch g_loss: {g_loss:.4f}, d_loss: {d_loss:.4f}')
    print(50*'-')
...

there is a new problem with almost no loss reduction while training, but it is in fact some different issue

AlphaBetaGamma96 · February 13, 2022, 9:07pm

If I had to guess it’s the switch_params command if you want to do a section of code without calcualting a gradients run it within the torch.no_grad context managed. So, for example change

        switch_params(model, False)
        fake_A2, mu_A2, logvar_A2 = model(x_A, select='B')
        fake_B2, mu_B2, logvar_B2 = model(x_B, select='A')

to

        with torch.no_grad():
          fake_A2, mu_A2, logvar_A2 = model(x_A, select='B')
          fake_B2, mu_B2, logvar_B2 = model(x_B, select='A')

and also remove the += inplace operators!