I’m new to DDP and getting some behavior I don’t understand. I have a line in my code that currently looks like this:
print(viewTuple)
newTensor = tensor.view(viewTuple)
viewTuple is defined earlier and always is either [1,-1] or [1,-1,1,1]. This line works properly on single GPU but on multi GPU is giving the error
RuntimeError: shape ‘[1, 1]’ is invalid for input of size 1000
This is the case even though the print never shows a [1,1] vector. Any idea what could be happening? Could the values be being overwritten by the second distributed run?
Is the print
statement directly before applying the view
operation showing [1, -1]
, but then fails with in the view
operation? If so, then I haven’t seen such a behavior before of Python list
s being manipulated silently by using multi-processing.
Yes, the print
is directly before the view
. It prints a bunch of vectors with a -1 in them to allow for reshaping but then I get the error with a [1,1] array. I also tried adding a print("flush", flush=True)
after as well to make sure all the prints are being flushed. Any other thoughts or is this something that I may need to create a minimal example to reproduce for debugging?