Reproducibility issue w.r.t. Functional Transformations with randomness

KarlSzp · November 14, 2022, 1:43pm

I’m working with DDP model with multi-gpus on one single node.

After seeding all reproducibility-related modules(random, numpy, torch), when two training process finished completely, they gived out exactly all the same output.

However, if one training process failed and exited by accident, try restarting from checkpoint gived out different output.

Based on it, I finally found what caused this difference is Functional Transformations with randomness.

class Augment(object):
    def __call__(self, img):
        rand_int = random.randint(1, 10) # Just a example
        # ... other operations
        return img

E.g.: consider a training task fails at epoch 3, restarting from checkpoint at epoch 2:

EPOCH — DATA — TRANSFORMATION
1st — data_1 — Aug(data_1) # Aug called once
2nd — data_2 — Aug(data_2) # Aug called twice
3rd — data_3 — Aug(data_3) # Aug should be called the third times, but code failed here

— Restarting —
3rd — data_3_restart — Aug(data_3_restart) # Aug called once

I’ve checked that data_3 and data_3_restart are all the same, but Aug(data_3) and Aug(data_3_restart) are different.

Sorry for my fuzzy explaination, any suggestions?
Thanks in advance.