DDP on 8 gpu work much worse then on single

hadaev8 · December 8, 2019, 12:25am

So i read this thread and some other, googled and etc

My model is seq to seq variable len input (tacotron 2).

This is how looks like loss
https://i.imgur.com/NAm5Woe.png

Any suggestions? Any way how to debug it?

Also if i remove all batchnorm from model, it raise unused variable error, how it work?

ptrblck · December 8, 2019, 1:05am

Are you using any reference implementation as your code base or did you write the complete code yourself?
Could you compare your implementation to this one written by NVIDIA?

hadaev8 · December 8, 2019, 1:08am

Im using fork of this repo

(looks like same with repo you linked)
But i have a lot of changes

ptrblck · December 8, 2019, 1:10am

In that case I would focus on the error (about the unused variable) by reverting the changes and make sure your model still trains.

I assume the error you mentioned claims that no parameters were found which require gradients during the backward pass.
If so, it’s usually a sign, that all parameters were frozen or that the computation graph was detached at some point.

hadaev8 · December 8, 2019, 1:12am

So, this problem happen then i comment out batchnorm layers.
Then i return it back, it runs (still worse then single gpu).

hadaev8 · December 8, 2019, 10:54am

Another observation: upgraded torch from 1.2 to 1.3.1 and now i have OOM error on same setup.

ptrblck · December 8, 2019, 4:45pm

If the original code base works and you see this unwanted behavior, I would still recommend to triage the bug by removing your changed parts.
The error points to a (partly) frozen model. I’m not familiar with the use case, so it might be a red herring. Anyway, it might be a good starter for debugging.

We are aware of the higher memory usage due to some added functionalities in a tensor method and are thinking about different approaches to fix this.

hadaev8 · December 8, 2019, 5:39pm

Most major difference between my model and nvidia’s is this modification.

github.com

bfs18/tacotron2/blob/master/train.py#L251


      
          x, y = model.parse_batch(batch)
          y_pred = model(x)
          
          loss = criterion(y_pred, y)
          if model.mi is not None:
              # transpose to [b, T, dim]
              decoder_outputs = y_pred[0].transpose(2, 1)
              ctc_text, ctc_text_lengths, aco_lengths = x[-2], x[-1], x[4]
              taco_loss = loss
              mi_loss = model.mi(decoder_outputs, ctc_text, aco_lengths, ctc_text_lengths)
              if hparams.use_gaf:
                  if i % gradient_adaptive_factor.UPDATE_GAF_EVERY_N_STEP == 0:
                      safe_loss = 0. * sum([x.sum() for x in model.parameters()])
                      gaf = gradient_adaptive_factor.calc_grad_adapt_factor(
                          taco_loss + safe_loss, mi_loss + safe_loss, model.parameters(), optimizer)
                      gaf = min(gaf, hparams.max_gaf)
              else:
                  gaf = 1.0
              loss = loss + gaf * mi_loss
          else:
              taco_loss = loss

Without gradient balancing lost curve much better (red)
https://i.imgur.com/uONtHJj.png

Does it matter if loss calculated in forward or outside?
Also is it good idea in general to balance loss value like this?
Tacotron by its own have 2 losses

github.com

bfs18/tacotron2/blob/master/loss_function.py#L19


      
          mel_target, gate_target = targets[0], targets[1]
          mel_target.requires_grad = False
          gate_target.requires_grad = False
          gate_target = gate_target.view(-1, 1)
          
          _, mel_out, mel_out_postnet, gate_out, _ = model_output
          gate_out = gate_out.view(-1, 1)
          mel_loss = nn.MSELoss()(mel_out, mel_target) + \
              nn.MSELoss()(mel_out_postnet, mel_target)
          gate_loss = nn.BCEWithLogitsLoss()(gate_out, gate_target)
          return mel_loss + gate_loss

Do you think it should be balanced too?

About batchnorm thing, i don’t understand how batchnorm affect params.
I mean it change only existed variables, do not freeze layers or something.
It work with batchnorm layers and raise error without.
https://i.imgur.com/Tc3yACY.png
May be because without batchnorm i have only one layer in module?
Or it doesnt matter?

I made another thread about memory usage (it is not only distributed thing) with example, hope it helps.

hadaev8 · December 28, 2019, 9:31pm

Is here any pytorch options to debug this situation?
I have made another test, gap seems to be too bad.

dabs · January 27, 2022, 12:49am

Not sure this is the case here, but in my case I was using autocast and GradScaler. I had both set to enabled=False. According to the docs this should mean these should have no effect, which was in fact the case with a single GPU and using DP.

However, with DDP I found that introducing these increased variance in the training and validation loss significantly, deteriorating model accuracy overall. According to the docs autocast and GradScaler shouldn’t adversely affect DDP, but it did just that in my case. Not sure why, but I assume it has to do with gradient synchronization in DDP.