Can't achive reproducability / determinism in pytorch training

Ecem_sogancioglu · November 30, 2021, 2:37pm

Hi,

I am having the same issue. I cannot get reproducible results training the FasterRCNN model in PyTorch. I have followed everything in REPRODUCIBILITY doc.

I set the seed at the beginning of my code as follows:

g = torch.Generator()
g.manual_seed(10)
def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)
def set_seed(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.enabled = False
    #torch.use_deterministic_algorithms(True)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
set_seed(10)

And I have disabled the augmentation and set the number of workers as 0 in data loader.

data_loader = torch.utils.data.DataLoader(
            dataset, batch_size=2, shuffle=True, num_workers=0,
            collate_fn=utils.collate_fn, worker_init_fn=seed_worker, generator=g)

Model is created as follows:

self.model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
num_classes = 2  # 1 class (nodule) + background
in_features = self.model.roi_heads.box_predictor.cls_score.in_features
self.model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
self.model.to(self.device)

I use cuda 10.2 with torch version 1.10.0, so I have also set the environmental variable CUBLAS_WORKSPACE_CONFIG as mentioned in the tutorial. Do you have an idea why the results are not reproducible - and any suggestions of what could I try?

Also, if i set torch.use_deterministic_algorithms(True) after setting the CUBLAS_WORKSPACE_CONFIG, i get the following error:

RuntimeError: linearIndex.numel()sliceSizenElemBefore == value.numel()INTERNAL ASSERT FAILED at “/pytorch/aten/src/ATen/native/cuda/Indexing.cu”:250, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor10231

Many thanks, Ecem