The bottleneck of my training routine is its data augmentation, which is “sufficiently” optimized. In order to speed-up hyperparameter search, I thought it’d be a good idea to train two models, each on another GPU, simultaneously using one dataloader.
As far as I understand, this could be seen as model parallel. However, my implementation failed.
Down below an example. After the first epoch, I expect the network weights to be identical. However, the loss1 is equal to loss2 just in the first iteration. Detaching and cloning the batch before moving it to the graphics cards didn’t change things.
You don’t have dropout in your models, right?
This could be also be due to numerical precision and nondeterminism, it’s hard to tell with the information at hand. One indication of this would be if you cannot pinpoint where they differ. Otherwise you could compare after the first iteration and find which forward activations or gradients differ.
There’s no dropout in my models. However, I’ve re-run the code with a model consisting of a single linear layer. Surprisingly, it works.
My model consists of conv, batch/instance norm, ReLU, AdaptiveAveragePooling, MaxPooling and linear layers, including skip connections. It’s essentially a ResNet.
Again, I really appreciate your feedback!
Edit:
Just noticed that the gradients of the input layer are different right from the first iteration. The maximum difference between the gradients of that layer is 8.5e^-5.
Yeah, but I get (running both nets on the same device) an error that is 1e-8ish, which seems to be within numerical precision.
When disabling cudnn , the error goes to 0 but I don’t know what exactly it is.
Hi, I just use 8x3090 to train 8 models with one dataloader.
However, the fp and bp of 8 models seem to be not running in parallel.
Do you guys know how to accelerate the training?
You could implement your custom sampler or just use the default sampler (note that you might want to seed the code in case you are shuffling the dataset).
However, I don’t fully understand your use case since you would just repeat the same operation n_gpus times. This wouldn’t be considered a distributed data parallel training since each forward/backward would create identical results, wouldn’t it?