I want to solstice some useful advice on how to debug a strange model behavior in my training. Recently, I have taken up a ML project, and in an effort to speed up training time, I converted the original training script’s Data parallel with torch’s distributed data-parallel with Nvidia Apex. My new model will run, however, I noted that no matter how long I trained it, the results are never any better than when it started, unlike the original implementation which was able to get good results.
The only thing peculiar about this model is that it used 2 separate Adam optimizers, one to optimize the layer’s parameters, and the other is used to optimize a user-defined parameters with torch grad on.
I am not quite sure why this is happening. Any advice is appreciated.
I would recommend to scale down the problem first e.g. by removing apex (and potentially other utilities) and test if the DDP model is working as expected.
Note that we also recommend to use the native automatic mixed precision training in the current master branch or from the nightly binaries. You can find the documentation here.
thanks! I did not know the Pytorch now offers native mix precision