I’m working with DDP model with multi-gpus on one single node.
After seeding all reproducibility-related modules(random, numpy, torch), when two training process finished completely, they gived out exactly all the same output.
However, if one training process failed and exited by accident, try restarting from checkpoint gived out different output.
Based on it, I finally found what caused this difference is
Functional Transformations with randomness.
class Augment(object): def __call__(self, img): rand_int = random.randint(1, 10) # Just a example # ... other operations return img
E.g.: consider a training task fails at epoch 3, restarting from checkpoint at epoch 2:
EPOCH — DATA — TRANSFORMATION
1st — data_1 — Aug(data_1) # Aug called once
2nd — data_2 — Aug(data_2) # Aug called twice
3rd — data_3 — Aug(data_3) # Aug should be called the third times, but code failed here
— Restarting —
3rd — data_3_restart — Aug(data_3_restart) # Aug called once
I’ve checked that
data_3_restart are all the same, but
Aug(data_3_restart) are different.
Sorry for my fuzzy explaination, any suggestions?
Thanks in advance.