DDP on 8 gpu work much worse then on single

So i read this thread and some other, googled and etc

My model is seq to seq variable len input (tacotron 2).

This is how looks like loss

Any suggestions? Any way how to debug it?

Also if i remove all batchnorm from model, it raise unused variable error, how it work?

Are you using any reference implementation as your code base or did you write the complete code yourself?
Could you compare your implementation to this one written by NVIDIA?

Im using fork of this repo

(looks like same with repo you linked)
But i have a lot of changes

In that case I would focus on the error (about the unused variable) by reverting the changes and make sure your model still trains.

I assume the error you mentioned claims that no parameters were found which require gradients during the backward pass.
If so, it’s usually a sign, that all parameters were frozen or that the computation graph was detached at some point.

So, this problem happen then i comment out batchnorm layers.
Then i return it back, it runs (still worse then single gpu).

Another observation: upgraded torch from 1.2 to 1.3.1 and now i have OOM error on same setup.

If the original code base works and you see this unwanted behavior, I would still recommend to triage the bug by removing your changed parts.
The error points to a (partly) frozen model. I’m not familiar with the use case, so it might be a red herring. Anyway, it might be a good starter for debugging.

We are aware of the higher memory usage due to some added functionalities in a tensor method and are thinking about different approaches to fix this.

Most major difference between my model and nvidia’s is this modification.

Without gradient balancing lost curve much better (red)

Does it matter if loss calculated in forward or outside?
Also is it good idea in general to balance loss value like this?
Tacotron by its own have 2 losses

Do you think it should be balanced too?

About batchnorm thing, i don’t understand how batchnorm affect params.
I mean it change only existed variables, do not freeze layers or something.
It work with batchnorm layers and raise error without.

May be because without batchnorm i have only one layer in module?
Or it doesnt matter?

I made another thread about memory usage (it is not only distributed thing) with example, hope it helps.

Is here any pytorch options to debug this situation?
I have made another test, gap seems to be too bad.

Not sure this is the case here, but in my case I was using autocast and GradScaler. I had both set to enabled=False. According to the docs this should mean these should have no effect, which was in fact the case with a single GPU and using DP.

However, with DDP I found that introducing these increased variance in the training and validation loss significantly, deteriorating model accuracy overall. According to the docs autocast and GradScaler shouldn’t adversely affect DDP, but it did just that in my case. Not sure why, but I assume it has to do with gradient synchronization in DDP.