Are you using any reference implementation as your code base or did you write the complete code yourself?
Could you compare your implementation to this one written by NVIDIA?
In that case I would focus on the error (about the unused variable) by reverting the changes and make sure your model still trains.
I assume the error you mentioned claims that no parameters were found which require gradients during the backward pass.
If so, it’s usually a sign, that all parameters were frozen or that the computation graph was detached at some point.
If the original code base works and you see this unwanted behavior, I would still recommend to triage the bug by removing your changed parts.
The error points to a (partly) frozen model. I’m not familiar with the use case, so it might be a red herring. Anyway, it might be a good starter for debugging.
We are aware of the higher memory usage due to some added functionalities in a tensor method and are thinking about different approaches to fix this.
Does it matter if loss calculated in forward or outside?
Also is it good idea in general to balance loss value like this?
Tacotron by its own have 2 losses
Do you think it should be balanced too?
About batchnorm thing, i don’t understand how batchnorm affect params.
I mean it change only existed variables, do not freeze layers or something.
It work with batchnorm layers and raise error without. https://i.imgur.com/Tc3yACY.png
May be because without batchnorm i have only one layer in module?
Or it doesnt matter?
I made another thread about memory usage (it is not only distributed thing) with example, hope it helps.
Not sure this is the case here, but in my case I was using autocast and GradScaler. I had both set to enabled=False. According to the docs this should mean these should have no effect, which was in fact the case with a single GPU and using DP.
However, with DDP I found that introducing these increased variance in the training and validation loss significantly, deteriorating model accuracy overall. According to the docs autocast and GradScaler shouldn’t adversely affect DDP, but it did just that in my case. Not sure why, but I assume it has to do with gradient synchronization in DDP.