I am running different trainings on different GPUs but all of them access the same data, i.e. png images in data directories.
The data in only read, not written to.
Is that a problem and can lead to inconsistent results?
I am grateful for information about that.
thanks in advance
I am running different tranings on 3 different GPUs all accessing same data. When running only 1 training, it is much faster than running all 3 (on different GPUs). Is it because all programs are accessing the same data?
Can anyone please answer this?
If you are using
nn.DataParallel, and the model workload is small, you might see the overhead of the communication (scatter, gather of model parameters, data etc.).
We usually recommend to use DistributedDataParallel with a single process for each device.
I am not using nn.DataParallel. I am just running 3 different independent programs all on different GPUs that access the same data. In this case, shall I use DistributedDataParallel with a single process for each device. If so, can you please tell me how to use that? Also, would it be better to create 3 copies of data and all 3 process their copy of data? Can this reduct the overhead?
P.S. - I am using AWS’s p3.8xlarge machine
I think your IO operations might create a bottleneck, if all processes read from the same disk.
Copying the data onto the same drive won’t avoid the bottleneck.