Deterministic Training

Hi, I have followed the steps to train my model deterministically here

https://pytorch.org/docs/1.8.1/notes/randomness.html

However, I’m unable to reproduce my results. After a single call to backward(), I can see my models begin to diverge. I am attempting to train Faster RCNN on a custom dataset.

I’d be glad to include additional details, but maybe the following is enough to figure out what’s going on. My code includes

 46 def seed_worker(worker_id):
 47     worker_seed = torch.initial_seed() % 2**32
 48     np.random.seed(worker_seed)
 49     random.seed(worker_seed)

As well as a seeding block in the beginning of my main() as follows

109     seed = 1
110     torch.use_deterministic_algorithms(True) # , in beta....
111     torch.backends.cudnn.deterministic = True
112     torch.backends.cudnn.benchmark = False
113     np.random.seed(seed)
114     random.seed(seed)
115     os.environ['PYTHONHASHSEED'] = str(seed)
116     torch.manual_seed(seed)
117     torch.cuda.manual_seed(seed)
118     torch.cuda.manual_seed_all(seed)

Am I missing something here, or is torch.use_determinisitc_algorithms still not working? I have the following versions in python 3.7.4 and CUDA 10.2

torch==1.8.1
torchfile==0.1.0
torchvision==0.9.1

Could you post the model definition as well as the shapes of all input tensors?

Sure. The model is the version of Faster RCNN found here vision/torchvision/models/detection at master · pytorch/vision · GitHub, and the code I’m working with is largely based on that and the code found here vision/references/detection at master · pytorch/vision · GitHub ; however, I have made some minor modifications, and the version I’m using likely no longer matches the one on github.

I might also note that I began this project with torch 1.5.0 and torchvision 0.6.0. Since recognizing this problem with reproducibility, I am now trying upgrading to torch 1.8.1 and torchvision 0.9.1 so that I can use the torch.use_deterministic_algorithms(), though if I could solve this problem with my older versions I think that might suit me better since I’ve copied files from torchvision 0.6.0 to my repo.

The model definition is as follows

    '''
    model = torchvision.models.detection.__dict__[args.model](num_classes=num_classes, pretrained=args.pretrained,
                                                              **kwargs)
    '''
    model = mymodels.__dict__[args.model](num_classes=num_classes,
            pretrained=args.pretrained, roi_drop_pct=args.roi_drop_pct, **kwargs)

The commented out line is essentially the same, but I’ve created my own mymodels directory with my updated versions of faster_rcnn.py and related files. At this point, the differences between mymodels and torchvision.models.detection are minimal, though I plan to make larger modifications soon. Some relevant arguments are as follows and show my model is [fasterrcnn_resnet50_fpn]github_link_redacted_new_user_limit/pytorch/vision/blob/59c6731897c2b8c48431136515ee80d235c9c2d1/torchvision/models/detection/faster_rcnn.py#L298), though, again, my version is older than on github.

Namespace(aspect_ratio_group_factor=3, batch_size=12, device='cuda', dist_url='env://', distributed=False, model='fasterrcnn_resnet50_fpn')

Input tensors are images from 1080p video frames cut in half. In training the inputs, images, are as follows

-> loss_dict = model(images, targets)
(Pdb++) len(images)
12
(Pdb++) images[0].shape
torch.Size([3, 540, 1920])

I apologize if there’s too much info: I just don’t want to leave anything out. I’d be glad to provide any additional information as well if it might help. Thanks for your quick reply, and hopefully we can get to the bottom of this.

Could you check the model in isolation by passing in the same input after calling model.eval() and using deterministic algorithms, and compare the outputs?

I tried the following

      model.eval()
      outs = model(images)
      outs2 = model(images)

All parts of outs and outs2 are exactly the same. I can also say that if I run the same training experiment twice, the loss will be nearly the same for the first step or two then slowly diverge.

I’ll post output from run1 and run2. Exact same command. I ran run1 for a few steps, stopped, and then ran run2 for a few steps. I ran this with a batch size of 1, and you can see the outputs slowly diverge if you look closely.

run1

Epoch: [0]  [   0/3684]  eta: 0:17:11  lr: 0.000020  loss: 2.0382 (2.0382)  loss_classifier: 1.3432 (1.3432)  loss_box_reg: 0.0000 (0.0000)  loss_objectness: 0.6915 (0.6915)  loss_rpn_box_reg: 0.0035 (0.0035)  time: 0.2800  data: 0.1153  max mem: 715
Epoch: [0]  [   1/3684]  eta: 0:17:51  lr: 0.000030  loss: 2.0382 (2.0423)  loss_classifier: 1.3432 (1.3477)  loss_box_reg: 0.0000 (0.0000)  loss_objectness: 0.6900 (0.6908)  loss_rpn_box_reg: 0.0035 (0.0038)  time: 0.2911  data: 0.1344  max mem: 1001
Epoch: [0]  [   2/3684]  eta: 0:14:53  lr: 0.000040  loss: 2.0464 (2.0477)  loss_classifier: 1.3432 (1.3440)  loss_box_reg: 0.0000 (0.0095)  loss_objectness: 0.6900 (0.6903)  loss_rpn_box_reg: 0.0040 (0.0039)  time: 0.2426  data: 0.0992  max mem: 1002
Epoch: [0]  [   3/3684]  eta: 0:14:37  lr: 0.000050  loss: 2.0464 (2.0489)  loss_classifier: 1.3369 (1.3422)  loss_box_reg: 0.0000 (0.0093)  loss_objectness: 0.6900 (0.6905)  loss_rpn_box_reg: 0.0040 (0.0069)  time: 0.2384  data: 0.1034  max mem: 1002
Epoch: [0]  [   4/3684]  eta: 0:13:21  lr: 0.000060  loss: 2.0464 (2.0431)  loss_classifier: 1.3369 (1.3391)  loss_box_reg: 0.0000 (0.0075)  loss_objectness: 0.6900 (0.6900)  loss_rpn_box_reg: 0.0041 (0.0065)  time: 0.2179  data: 0.0882  max mem: 1002
Epoch: [0]  [   5/3684]  eta: 0:13:08  lr: 0.000070  loss: 2.0382 (2.0409)  loss_classifier: 1.3365 (1.3372)  loss_box_reg: 0.0000 (0.0062)  loss_objectness: 0.6895 (0.6900)  loss_rpn_box_reg: 0.0041 (0.0076)  time: 0.2144  data: 0.0876  max mem: 1002
Epoch: [0]  [   6/3684]  eta: 0:13:28  lr: 0.000080  loss: 2.0382 (2.0362)  loss_classifier: 1.3365 (1.3325)  loss_box_reg: 0.0000 (0.0053)  loss_objectness: 0.6895 (0.6896)  loss_rpn_box_reg: 0.0052 (0.0087)  time: 0.2197  data: 0.0952  max mem: 1002
Epoch: [0]  [   7/3684]  eta: 0:12:49  lr: 0.000090  loss: 2.0301 (2.0332)  loss_classifier: 1.3276 (1.3301)  loss_box_reg: 0.0000 (0.0047)  loss_objectness: 0.6893 (0.6895)  loss_rpn_box_reg: 0.0052 (0.0089)  time: 0.2092  data: 0.0865  max mem: 1002
Epoch: [0]  [   8/3684]  eta: 0:13:00  lr: 0.000100  loss: 2.0301 (2.0278)  loss_classifier: 1.3276 (1.3247)  loss_box_reg: 0.0000 (0.0053)  loss_objectness: 0.6893 (0.6892)  loss_rpn_box_reg: 0.0061 (0.0086)  time: 0.2124  data: 0.0910  max mem: 1002
Epoch: [0]  [   9/3684]  eta: 0:12:32  lr: 0.000110  loss: 2.0200 (2.0244)  loss_classifier: 1.3265 (1.3203)  loss_box_reg: 0.0000 (0.0067)  loss_objectness: 0.6893 (0.6893)  loss_rpn_box_reg: 0.0052 (0.0081)  time: 0.2047  data: 0.0844  max mem: 1002
Epoch: [0]  [  10/3684]  eta: 0:12:28  lr: 0.000120  loss: 2.0200 (2.0166)  loss_classifier: 1.3265 (1.3136)  loss_box_reg: 0.0000 (0.0061)  loss_objectness: 0.6893 (0.6892)  loss_rpn_box_reg: 0.0052 (0.0078)  time: 0.2038  data: 0.0841  max mem: 1002
Epoch: [0]  [  11/3684]  eta: 0:12:14  lr: 0.000130  loss: 2.0122 (2.0145)  loss_classifier: 1.3136 (1.3043)  loss_box_reg: 0.0000 (0.0056)  loss_objectness: 0.6893 (0.6894)  loss_rpn_box_reg: 0.0052 (0.0153)  time: 0.2000  data: 0.0810  max mem: 1002
Epoch: [0]  [  12/3684]  eta: 0:12:31  lr: 0.000140  loss: 2.0122 (2.0048)  loss_classifier: 1.3136 (1.2952)  loss_box_reg: 0.0000 (0.0051)  loss_objectness: 0.6893 (0.6894)  loss_rpn_box_reg: 0.0061 (0.0151)  time: 0.2046  data: 0.0862  max mem: 1002
Epoch: [0]  [  13/3684]  eta: 0:12:18  lr: 0.000150  loss: 2.0075 (1.9951)  loss_classifier: 1.3043 (1.2863)  loss_box_reg: 0.0000 (0.0048)  loss_objectness: 0.6887 (0.6891)  loss_rpn_box_reg: 0.0061 (0.0149)  time: 0.2011  data: 0.0834  max mem: 1002
Epoch: [0]  [  14/3684]  eta: 0:12:07  lr: 0.000160  loss: 2.0075 (1.9882)  loss_classifier: 1.3043 (1.2740)  loss_box_reg: 0.0000 (0.0074)  loss_objectness: 0.6893 (0.6891)  loss_rpn_box_reg: 0.0099 (0.0176)  time: 0.1983  data: 0.0809  max mem: 1002
Epoch: [0]  [  15/3684]  eta: 0:12:28  lr: 0.000170  loss: 1.9938 (1.9751)  loss_classifier: 1.2814 (1.2614)  loss_box_reg: 0.0000 (0.0081)  loss_objectness: 0.6887 (0.6887)  loss_rpn_box_reg: 0.0066 (0.0169)  time: 0.2041  data: 0.0871  max mem: 1002
Epoch: [0]  [  16/3684]  eta: 0:12:12  lr: 0.000180  loss: 1.9938 (1.9628)  loss_classifier: 1.2814 (1.2481)  loss_box_reg: 0.0000 (0.0102)  loss_objectness: 0.6887 (0.6884)  loss_rpn_box_reg: 0.0066 (0.0162)  time: 0.1996  data: 0.0830  max mem: 1002
Epoch: [0]  [  17/3684]  eta: 0:12:19  lr: 0.000190  loss: 1.9915 (1.9494)  loss_classifier: 1.2813 (1.2354)  loss_box_reg: 0.0000 (0.0098)  loss_objectness: 0.6884 (0.6882)  loss_rpn_box_reg: 0.0066 (0.0160)  time: 0.2018  data: 0.0855  max mem: 1002
Epoch: [0]  [  18/3684]  eta: 0:12:27  lr: 0.000200  loss: 1.9915 (1.9312)  loss_classifier: 1.2813 (1.2185)  loss_box_reg: 0.0000 (0.0092)  loss_objectness: 0.6884 (0.6880)  loss_rpn_box_reg: 0.0066 (0.0154)  time: 0.2040  data: 0.0875  max mem: 1002
Epoch: [0]  [  19/3684]  eta: 0:12:34  lr: 0.000210  loss: 1.9850 (1.9251)  loss_classifier: 1.2456 (1.2027)  loss_box_reg: 0.0000 (0.0101)  loss_objectness: 0.6882 (0.6879)  loss_rpn_box_reg: 0.0066 (0.0244)  time: 0.2060  data: 0.0897  max mem: 1002
Epoch: [0]  [  20/3684]  eta: 0:12:39  lr: 0.000220  loss: 1.9387 (1.9042)  loss_classifier: 1.2020 (1.1826)  loss_box_reg: 0.0000 (0.0104)  loss_objectness: 0.6882 (0.6874)  loss_rpn_box_reg: 0.0099 (0.0237)  time: 0.2038  data: 0.0902  max mem: 1002
Epoch: [0]  [  21/3684]  eta: 0:12:27  lr: 0.000230  loss: 1.8914 (1.8828)  loss_classifier: 1.1865 (1.1625)  loss_box_reg: 0.0000 (0.0100)  loss_objectness: 0.6877 (0.6869)  loss_rpn_box_reg: 0.0105 (0.0234)  time: 0.1955  data: 0.0838  max mem: 1002
Epoch: [0]  [  22/3684]  eta: 0:12:16  lr: 0.000240  loss: 1.8886 (1.8611)  loss_classifier: 1.1704 (1.1413)  loss_box_reg: 0.0000 (0.0108)  loss_objectness: 0.6872 (0.6864)  loss_rpn_box_reg: 0.0105 (0.0226)  time: 0.1949  data: 0.0836  max mem: 1002
Epoch: [0]  [  23/3684]  eta: 0:12:24  lr: 0.000250  loss: 1.8679 (1.8338)  loss_classifier: 1.1021 (1.1156)  loss_box_reg: 0.0000 (0.0105)  loss_objectness: 0.6871 (0.6857)  loss_rpn_box_reg: 0.0099 (0.0220)  time: 0.1965  data: 0.0852  max mem: 1002
Epoch: [0]  [  24/3684]  eta: 0:12:29  lr: 0.000260  loss: 1.8105 (1.8056)  loss_classifier: 1.0725 (1.0888)  loss_box_reg: 0.0000 (0.0101)  loss_objectness: 0.6851 (0.6849)  loss_rpn_box_reg: 0.0105 (0.0218)  time: 0.2014  data: 0.0899  max mem: 1002
Epoch: [0]  [  25/3684]  eta: 0:12:30  lr: 0.000270  loss: 1.7799 (1.7783)  loss_classifier: 1.0348 (1.0624)  loss_box_reg: 0.0026 (0.0099)  loss_objectness: 0.6850 (0.6841)  loss_rpn_box_reg: 0.0105 (0.0219)  time: 0.2022  data: 0.0908  max mem: 1002
Epoch: [0]  [  26/3684]  eta: 0:12:24  lr: 0.000280  loss: 1.7658 (1.7449)  loss_classifier: 1.0194 (1.0310)  loss_box_reg: 0.0026 (0.0095)  loss_objectness: 0.6836 (0.6831)  loss_rpn_box_reg: 0.0100 (0.0214)  time: 0.1978  data: 0.0864  max mem: 1002
Epoch: [0]  [  27/3684]  eta: 0:12:14  lr: 0.000290  loss: 1.7212 (1.7176)  loss_classifier: 0.9156 (1.0056)  loss_box_reg: 0.0026 (0.0092)  loss_objectness: 0.6832 (0.6819)  loss_rpn_box_reg: 0.0100 (0.0209)  time: 0.1974  data: 0.0860  max mem: 1002
Epoch: [0]  [  28/3684]  eta: 0:12:18  lr: 0.000300  loss: 1.6031 (1.6868)  loss_classifier: 0.9014 (0.9773)  loss_box_reg: 0.0000 (0.0089)  loss_objectness: 0.6827 (0.6803)  loss_rpn_box_reg: 0.0100 (0.0203)  time: 0.1973  data: 0.0859  max mem: 1002
Epoch: [0]  [  29/3684]  eta: 0:12:21  lr: 0.000310  loss: 1.4847 (1.6584)  loss_classifier: 0.7819 (0.9510)  loss_box_reg: 0.0000 (0.0092)  loss_objectness: 0.6769 (0.6782)  loss_rpn_box_reg: 0.0105 (0.0200)  time: 0.2018  data: 0.0903  max mem: 1002
Epoch: [0]  [  30/3684]  eta: 0:12:20  lr: 0.000320  loss: 1.4331 (1.6275)  loss_classifier: 0.7388 (0.9236)  loss_box_reg: 0.0002 (0.0089)  loss_objectness: 0.6758 (0.6754)  loss_rpn_box_reg: 0.0105 (0.0196)  time: 0.2020  data: 0.0907  max mem: 1002
Epoch: [0]  [  31/3684]  eta: 0:12:33  lr: 0.000330  loss: 1.3837 (1.6015)  loss_classifier: 0.6750 (0.9002)  loss_box_reg: 0.0012 (0.0086)  loss_objectness: 0.6746 (0.6728)  loss_rpn_box_reg: 0.0105 (0.0198)  time: 0.2099  data: 0.0986  max mem: 1002
Epoch: [0]  [  32/3684]  eta: 0:12:38  lr: 0.000340  loss: 1.2076 (1.5749)  loss_classifier: 0.5261 (0.8773)  loss_box_reg: 0.0012 (0.0084)  loss_objectness: 0.6699 (0.6693)  loss_rpn_box_reg: 0.0105 (0.0198)  time: 0.2099  data: 0.0984  max mem: 1002
Epoch: [0]  [  33/3684]  eta: 0:12:43  lr: 0.000350  loss: 1.1288 (1.5473)  loss_classifier: 0.4452 (0.8535)  loss_box_reg: 0.0026 (0.0083)  loss_objectness: 0.6666 (0.6655)  loss_rpn_box_reg: 0.0105 (0.0199)  time: 0.2149  data: 0.1034  max mem: 1002
Killing subprocess 86494
Main process received SIGINT, exiting

run2

Epoch: [0]  [   0/3684]  eta: 0:18:39  lr: 0.000020  loss: 2.0382 (2.0382)  loss_classifier: 1.3432 (1.3432)  loss_box_reg: 0.0000 (0.0000)  loss_objectness: 0.6915 (0.6915)  loss_rpn_box_reg: 0.0035 (0.0035)  time: 0.3039  data: 0.1116  max mem: 715
Epoch: [0]  [   1/3684]  eta: 0:18:53  lr: 0.000030  loss: 2.0382 (2.0423)  loss_classifier: 1.3432 (1.3477)  loss_box_reg: 0.0000 (0.0000)  loss_objectness: 0.6900 (0.6908)  loss_rpn_box_reg: 0.0035 (0.0038)  time: 0.3078  data: 0.1319  max mem: 1001
Epoch: [0]  [   2/3684]  eta: 0:15:22  lr: 0.000040  loss: 2.0464 (2.0476)  loss_classifier: 1.3432 (1.3439)  loss_box_reg: 0.0000 (0.0095)  loss_objectness: 0.6900 (0.6903)  loss_rpn_box_reg: 0.0040 (0.0039)  time: 0.2506  data: 0.0971  max mem: 1002
Epoch: [0]  [   3/3684]  eta: 0:14:51  lr: 0.000050  loss: 2.0464 (2.0489)  loss_classifier: 1.3370 (1.3422)  loss_box_reg: 0.0000 (0.0093)  loss_objectness: 0.6900 (0.6905)  loss_rpn_box_reg: 0.0040 (0.0069)  time: 0.2422  data: 0.1001  max mem: 1002
Epoch: [0]  [   4/3684]  eta: 0:13:31  lr: 0.000060  loss: 2.0464 (2.0431)  loss_classifier: 1.3370 (1.3390)  loss_box_reg: 0.0000 (0.0075)  loss_objectness: 0.6900 (0.6900)  loss_rpn_box_reg: 0.0041 (0.0065)  time: 0.2204  data: 0.0854  max mem: 1002
Epoch: [0]  [   5/3684]  eta: 0:13:10  lr: 0.000070  loss: 2.0382 (2.0409)  loss_classifier: 1.3363 (1.3371)  loss_box_reg: 0.0000 (0.0062)  loss_objectness: 0.6895 (0.6900)  loss_rpn_box_reg: 0.0041 (0.0076)  time: 0.2150  data: 0.0846  max mem: 1002
Epoch: [0]  [   6/3684]  eta: 0:13:18  lr: 0.000080  loss: 2.0382 (2.0362)  loss_classifier: 1.3363 (1.3325)  loss_box_reg: 0.0000 (0.0053)  loss_objectness: 0.6895 (0.6896)  loss_rpn_box_reg: 0.0052 (0.0087)  time: 0.2171  data: 0.0899  max mem: 1002
Epoch: [0]  [   7/3684]  eta: 0:12:38  lr: 0.000090  loss: 2.0301 (2.0332)  loss_classifier: 1.3276 (1.3301)  loss_box_reg: 0.0000 (0.0047)  loss_objectness: 0.6893 (0.6895)  loss_rpn_box_reg: 0.0052 (0.0089)  time: 0.2063  data: 0.0818  max mem: 1002
Epoch: [0]  [   8/3684]  eta: 0:12:59  lr: 0.000100  loss: 2.0301 (2.0278)  loss_classifier: 1.3276 (1.3247)  loss_box_reg: 0.0000 (0.0053)  loss_objectness: 0.6893 (0.6892)  loss_rpn_box_reg: 0.0061 (0.0086)  time: 0.2119  data: 0.0891  max mem: 1002
Epoch: [0]  [   9/3684]  eta: 0:12:29  lr: 0.000110  loss: 2.0198 (2.0245)  loss_classifier: 1.3263 (1.3204)  loss_box_reg: 0.0000 (0.0067)  loss_objectness: 0.6893 (0.6893)  loss_rpn_box_reg: 0.0052 (0.0081)  time: 0.2041  data: 0.0827  max mem: 1002
Epoch: [0]  [  10/3684]  eta: 0:12:26  lr: 0.000120  loss: 2.0198 (2.0166)  loss_classifier: 1.3263 (1.3136)  loss_box_reg: 0.0000 (0.0061)  loss_objectness: 0.6893 (0.6892)  loss_rpn_box_reg: 0.0052 (0.0078)  time: 0.2032  data: 0.0828  max mem: 1002
Epoch: [0]  [  11/3684]  eta: 0:12:11  lr: 0.000130  loss: 2.0119 (2.0145)  loss_classifier: 1.3133 (1.3042)  loss_box_reg: 0.0000 (0.0056)  loss_objectness: 0.6893 (0.6894)  loss_rpn_box_reg: 0.0052 (0.0153)  time: 0.1993  data: 0.0800  max mem: 1002
Epoch: [0]  [  12/3684]  eta: 0:12:26  lr: 0.000140  loss: 2.0119 (2.0050)  loss_classifier: 1.3133 (1.2954)  loss_box_reg: 0.0000 (0.0051)  loss_objectness: 0.6893 (0.6894)  loss_rpn_box_reg: 0.0061 (0.0151)  time: 0.2033  data: 0.0847  max mem: 1002
Epoch: [0]  [  13/3684]  eta: 0:12:12  lr: 0.000150  loss: 2.0080 (1.9952)  loss_classifier: 1.3048 (1.2864)  loss_box_reg: 0.0000 (0.0048)  loss_objectness: 0.6887 (0.6891)  loss_rpn_box_reg: 0.0061 (0.0149)  time: 0.1996  data: 0.0816  max mem: 1002
Epoch: [0]  [  14/3684]  eta: 0:12:01  lr: 0.000160  loss: 2.0080 (1.9879)  loss_classifier: 1.3048 (1.2737)  loss_box_reg: 0.0000 (0.0074)  loss_objectness: 0.6893 (0.6891)  loss_rpn_box_reg: 0.0099 (0.0176)  time: 0.1967  data: 0.0793  max mem: 1002
Epoch: [0]  [  15/3684]  eta: 0:12:21  lr: 0.000170  loss: 1.9946 (1.9751)  loss_classifier: 1.2820 (1.2614)  loss_box_reg: 0.0000 (0.0081)  loss_objectness: 0.6887 (0.6887)  loss_rpn_box_reg: 0.0066 (0.0169)  time: 0.2021  data: 0.0852  max mem: 1002
Epoch: [0]  [  16/3684]  eta: 0:12:04  lr: 0.000180  loss: 1.9946 (1.9630)  loss_classifier: 1.2820 (1.2483)  loss_box_reg: 0.0000 (0.0102)  loss_objectness: 0.6887 (0.6884)  loss_rpn_box_reg: 0.0066 (0.0162)  time: 0.1976  data: 0.0813  max mem: 1002
Epoch: [0]  [  17/3684]  eta: 0:12:11  lr: 0.000190  loss: 1.9904 (1.9489)  loss_classifier: 1.2814 (1.2349)  loss_box_reg: 0.0000 (0.0098)  loss_objectness: 0.6884 (0.6882)  loss_rpn_box_reg: 0.0066 (0.0160)  time: 0.1995  data: 0.0836  max mem: 1002
Epoch: [0]  [  18/3684]  eta: 0:12:17  lr: 0.000200  loss: 1.9904 (1.9305)  loss_classifier: 1.2814 (1.2178)  loss_box_reg: 0.0000 (0.0092)  loss_objectness: 0.6884 (0.6880)  loss_rpn_box_reg: 0.0066 (0.0154)  time: 0.2012  data: 0.0857  max mem: 1002
Epoch: [0]  [  19/3684]  eta: 0:12:21  lr: 0.000210  loss: 1.9850 (1.9243)  loss_classifier: 1.2452 (1.2018)  loss_box_reg: 0.0000 (0.0101)  loss_objectness: 0.6882 (0.6879)  loss_rpn_box_reg: 0.0066 (0.0244)  time: 0.2023  data: 0.0871  max mem: 1002
Epoch: [0]  [  20/3684]  eta: 0:12:24  lr: 0.000220  loss: 1.9382 (1.9027)  loss_classifier: 1.2009 (1.1810)  loss_box_reg: 0.0000 (0.0107)  loss_objectness: 0.6882 (0.6874)  loss_rpn_box_reg: 0.0099 (0.0237)  time: 0.1983  data: 0.0874  max mem: 1002
Epoch: [0]  [  21/3684]  eta: 0:12:14  lr: 0.000230  loss: 1.8916 (1.8804)  loss_classifier: 1.1895 (1.1598)  loss_box_reg: 0.0000 (0.0102)  loss_objectness: 0.6877 (0.6869)  loss_rpn_box_reg: 0.0105 (0.0234)  time: 0.1897  data: 0.0811  max mem: 1002
Epoch: [0]  [  22/3684]  eta: 0:12:03  lr: 0.000240  loss: 1.8851 (1.8585)  loss_classifier: 1.1701 (1.1385)  loss_box_reg: 0.0000 (0.0111)  loss_objectness: 0.6872 (0.6864)  loss_rpn_box_reg: 0.0105 (0.0226)  time: 0.1895  data: 0.0809  max mem: 1002
Epoch: [0]  [  23/3684]  eta: 0:12:12  lr: 0.000250  loss: 1.8677 (1.8309)  loss_classifier: 1.0958 (1.1125)  loss_box_reg: 0.0000 (0.0107)  loss_objectness: 0.6871 (0.6857)  loss_rpn_box_reg: 0.0099 (0.0220)  time: 0.1917  data: 0.0829  max mem: 1002
Epoch: [0]  [  24/3684]  eta: 0:12:19  lr: 0.000260  loss: 1.8066 (1.8026)  loss_classifier: 1.0762 (1.0856)  loss_box_reg: 0.0000 (0.0103)  loss_objectness: 0.6851 (0.6849)  loss_rpn_box_reg: 0.0105 (0.0218)  time: 0.1976  data: 0.0887  max mem: 1002
Epoch: [0]  [  25/3684]  eta: 0:12:21  lr: 0.000270  loss: 1.7836 (1.7753)  loss_classifier: 1.0390 (1.0593)  loss_box_reg: 0.0026 (0.0101)  loss_objectness: 0.6850 (0.6841)  loss_rpn_box_reg: 0.0105 (0.0219)  time: 0.1989  data: 0.0899  max mem: 1002
Epoch: [0]  [  26/3684]  eta: 0:12:14  lr: 0.000280  loss: 1.7700 (1.7420)  loss_classifier: 1.0066 (1.0278)  loss_box_reg: 0.0026 (0.0097)  loss_objectness: 0.6836 (0.6830)  loss_rpn_box_reg: 0.0100 (0.0214)  time: 0.1951  data: 0.0861  max mem: 1002
Epoch: [0]  [  27/3684]  eta: 0:12:05  lr: 0.000290  loss: 1.7084 (1.7136)  loss_classifier: 0.9115 (1.0015)  loss_box_reg: 0.0026 (0.0094)  loss_objectness: 0.6832 (0.6819)  loss_rpn_box_reg: 0.0100 (0.0209)  time: 0.1951  data: 0.0859  max mem: 1002
Killing subprocess 87072
Main process received SIGINT, exiting

Thanks for the update! Could you check, if updating to the nightly release would change the behavior or raise an error after setting use_deterministic_algorithms(True)?
If that’s not the case, could you run a debugging step of using predefined tensors (store the tensors locally and just load them or use e.g. torch.full(size, loop_index)) and compare the outputs again?

I just tried installing and running with nightly updates. First, I got some error about my distributed launch. The command I typically use to run my code is

time CUDA_VISIBLE_DEVICES=1 python -m torch.distributed.launch --nproc_per_node=1 --use_env train.py --epochs 10 --output-dir exps/tmp --lr 0.1 --workers 0 --batch-size 1

However, I’m currently not running distributed while debugging and have args.distributed defaulted to False. This command gives a lot of output and seems to be trying to run my code multiple times, which I’m wondering if this is somehow causing my errors? There’s a lot of output including INFO and WARNings, but you can see ERRORs as well.

/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/launch.py:164: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  "The module torch.distributed.launch is deprecated "
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : train.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_h69nlzad/none_lpixji6r
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_h69nlzad/none_lpixji6r/attempt_0/0/error.json
not using distributed mode
Namespace(aspect_ratio_group_factor=3, aug_pct=0.5, batch_size=1, data_path='', dataset='mare', device='cuda', dist_url='env://', distributed=False, epochs=10, lr=0.1, lr_gamma=0.99, lr_step_size=8, lr_steps=[16, 22], model='fasterrcnn_resnet50_fpn', momentum=0.9, output_dir='exps/tmp', pretrained=False, print_freq=20, resume='', roi_drop_pct=0.0, rpn_score_thresh=None, short=False, start_epoch=0, start_weights='', test_only_weights='', trainable_backbone_layers=None, trainsplit='trainkf', use_ia=False, valsplit='valfull', weight_decay=0.0001, workers=0, world_size=1)
Loading data
../data/idd_lsts/trainkf_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
../data/idd_lsts/valfull_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [3684]
Creating model
OUT CHANNELS 256
using ExponentialLR
Start training
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
  what():  linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor9241
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f76dbd97302 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f76dbd93c9b in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7f76dbd9418e in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7f7524c9a268 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7f7566bac423 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 81719) of binary: /home/mcever/.virtualenvs/tchnite/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_h69nlzad/none_lpixji6r/attempt_1/0/error.json
not using distributed mode
Namespace(aspect_ratio_group_factor=3, aug_pct=0.5, batch_size=1, data_path='', dataset='mare', device='cuda', dist_url='env://', distributed=False, epochs=10, lr=0.1, lr_gamma=0.99, lr_step_size=8, lr_steps=[16, 22], model='fasterrcnn_resnet50_fpn', momentum=0.9, output_dir='exps/tmp', pretrained=False, print_freq=20, resume='', roi_drop_pct=0.0, rpn_score_thresh=None, short=False, start_epoch=0, start_weights='', test_only_weights='', trainable_backbone_layers=None, trainsplit='trainkf', use_ia=False, valsplit='valfull', weight_decay=0.0001, workers=0, world_size=1)
Loading data
../data/idd_lsts/trainkf_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
../data/idd_lsts/valfull_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [3684]
Creating model
OUT CHANNELS 256
using ExponentialLR
Start training
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
  what():  linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor9241
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f4be06e0302 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f4be06dcc9b in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7f4be06dd18e in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7f4a27e66268 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7f4a69d78423 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 82044) of binary: /home/mcever/.virtualenvs/tchnite/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=2
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_h69nlzad/none_lpixji6r/attempt_2/0/error.json
not using distributed mode
Namespace(aspect_ratio_group_factor=3, aug_pct=0.5, batch_size=1, data_path='', dataset='mare', device='cuda', dist_url='env://', distributed=False, epochs=10, lr=0.1, lr_gamma=0.99, lr_step_size=8, lr_steps=[16, 22], model='fasterrcnn_resnet50_fpn', momentum=0.9, output_dir='exps/tmp', pretrained=False, print_freq=20, resume='', roi_drop_pct=0.0, rpn_score_thresh=None, short=False, start_epoch=0, start_weights='', test_only_weights='', trainable_backbone_layers=None, trainsplit='trainkf', use_ia=False, valsplit='valfull', weight_decay=0.0001, workers=0, world_size=1)
Loading data
../data/idd_lsts/trainkf_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
../data/idd_lsts/valfull_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [3684]
Creating model
OUT CHANNELS 256
using ExponentialLR
Start training
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
  what():  linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor9241
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f6bede91302 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f6bede8dc9b in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7f6bede8e18e in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7f6a36d94268 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7f6a78ca6423 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 82370) of binary: /home/mcever/.virtualenvs/tchnite/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=3
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_h69nlzad/none_lpixji6r/attempt_3/0/error.json
not using distributed mode
Namespace(aspect_ratio_group_factor=3, aug_pct=0.5, batch_size=1, data_path='', dataset='mare', device='cuda', dist_url='env://', distributed=False, epochs=10, lr=0.1, lr_gamma=0.99, lr_step_size=8, lr_steps=[16, 22], model='fasterrcnn_resnet50_fpn', momentum=0.9, output_dir='exps/tmp', pretrained=False, print_freq=20, resume='', roi_drop_pct=0.0, rpn_score_thresh=None, short=False, start_epoch=0, start_weights='', test_only_weights='', trainable_backbone_layers=None, trainsplit='trainkf', use_ia=False, valsplit='valfull', weight_decay=0.0001, workers=0, world_size=1)
Loading data
../data/idd_lsts/trainkf_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
../data/idd_lsts/valfull_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [3684]
Creating model
OUT CHANNELS 256
using ExponentialLR
Start training
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
  what():  linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor9241
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fb431662302 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fb43165ec9b in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7fb43165f18e in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7fb27a565268 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7fb2bc477423 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 82746) of binary: /home/mcever/.virtualenvs/tchnite/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006990432739257812 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "82746", "role": "default", "hostname": "mind", "state": "FAILED", "total_run_time": 40, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "mind", "state": "SUCCEEDED", "total_run_time": 40, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 3}}
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:354: UserWarning: 

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 82746 (local_rank 0) FAILED (exitcode -6)
Error msg: Signal 6 (SIGABRT) received by PID 82746
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************
  warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/launch.py", line 173, in <module>
    main()
  File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/launch.py", line 169, in main
    run(args)
  File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/run.py", line 624, in run
    )(*cmd_args)
  File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
*************************************************
                 train.py FAILED                 
=================================================
Root Cause:
[0]:
  time: 2021-05-21_09:52:12
  rank: 0 (local_rank: 0)
  exitcode: -6 (pid: 82746)
  error_file: <N/A>
  msg: "Signal 6 (SIGABRT) received by PID 82746"
=================================================
Other Failures:
  <NO_OTHER_FAILURES>
*************************************************


real	0m40.962s
user	1m6.204s
sys	0m51.743s

Seeing this, I decided to try without any distributed.launch, which is something I haven’t done with this code because I wanted to maintain the ability to run in a distributed manner, but if that is the cost of reproduciblity, I think I can afford to run on a single GPU. So I tried the following command

time CUDA_VISIBLE_DEVICES=1 python train.py --epochs 10 --output-dir exps/tmp --lr 0.1 --workers 0 --batch-size 1

With the nightly build, I still get an error:

not using distributed mode
Namespace(aspect_ratio_group_factor=3, aug_pct=0.5, batch_size=1, data_path='', dataset='mare', device='cuda', dist_url='env://', distributed=False, epochs=10, lr=0.1, lr_gamma=0.99, lr_step_size=8, lr_steps=[16, 22], model='fasterrcnn_resnet50_fpn', momentum=0.9, output_dir='exps/tmp', pretrained=False, print_freq=20, resume='', roi_drop_pct=0.0, rpn_score_thresh=None, short=False, start_epoch=0, start_weights='', test_only_weights='', trainable_backbone_layers=None, trainsplit='trainkf', use_ia=False, valsplit='valfull', weight_decay=0.0001, workers=0, world_size=1)
Loading data
../data/idd_lsts/trainkf_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
../data/idd_lsts/valfull_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [3684]
Creating model
OUT CHANNELS 256
using ExponentialLR
Start training
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
  what():  linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor9241
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f070fd5c302 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f070fd58c9b in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7f070fd5918e in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7f0558c5f268 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7f059ab71423 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)

Aborted (core dumped)

Above seems to be the most important error, though the most cryptic.

Going back to torch-1.8.1, I decided to try the non-distributed command:

time CUDA_VISIBLE_DEVICES=1 python train.py --epochs 10 --output-dir exps/tmp --lr 0.1 --workers 0 --batch-size 1 --print-freq 1

I ran this command twice, and still, the loss quickly begins to diverge slightly.

run 1 output:

Epoch: [0]  [   0/3684]  eta: 0:16:34  lr: 0.000200  loss: 2.0382 (2.0382)  loss_classifier: 1.3432 (1.3432)  loss_box_reg: 0.0000 (0.0000)  loss_objectness: 0.6915 (0.6915)  loss_rpn_box_reg: 0.0035 (0.0035)  time: 0.2699  data: 0.1061  max mem: 715
Epoch: [0]  [   1/3684]  eta: 0:17:34  lr: 0.000300  loss: 2.0382 (2.0409)  loss_classifier: 1.3432 (1.3464)  loss_box_reg: 0.0000 (0.0000)  loss_objectness: 0.6900 (0.6907)  loss_rpn_box_reg: 0.0035 (0.0038)  time: 0.2864  data: 0.1331  max mem: 1001
Epoch: [0]  [   2/3684]  eta: 0:14:35  lr: 0.000400  loss: 2.0406 (2.0408)  loss_classifier: 1.3432 (1.3372)  loss_box_reg: 0.0000 (0.0095)  loss_objectness: 0.6900 (0.6902)  loss_rpn_box_reg: 0.0040 (0.0039)  time: 0.2378  data: 0.0984  max mem: 1002
Epoch: [0]  [   3/3684]  eta: 0:14:23  lr: 0.000500  loss: 2.0382 (2.0326)  loss_classifier: 1.3189 (1.3283)  loss_box_reg: 0.0000 (0.0072)  loss_objectness: 0.6900 (0.6903)  loss_rpn_box_reg: 0.0040 (0.0069)  time: 0.2345  data: 0.1017  max mem: 1002
Epoch: [0]  [   4/3684]  eta: 0:13:13  lr: 0.000600  loss: 2.0382 (2.0142)  loss_classifier: 1.3189 (1.3124)  loss_box_reg: 0.0000 (0.0057)  loss_objectness: 0.6900 (0.6896)  loss_rpn_box_reg: 0.0041 (0.0065)  time: 0.2157  data: 0.0870  max mem: 1002
Epoch: [0]  [   5/3684]  eta: 0:13:01  lr: 0.000699  loss: 2.0080 (1.9913)  loss_classifier: 1.3016 (1.2898)  loss_box_reg: 0.0000 (0.0048)  loss_objectness: 0.6891 (0.6891)  loss_rpn_box_reg: 0.0041 (0.0076)  time: 0.2123  data: 0.0864  max mem: 1002
Epoch: [0]  [   6/3684]  eta: 0:13:13  lr: 0.000799  loss: 2.0080 (1.9580)  loss_classifier: 1.3016 (1.2569)  loss_box_reg: 0.0000 (0.0041)  loss_objectness: 0.6891 (0.6883)  loss_rpn_box_reg: 0.0052 (0.0087)  time: 0.2158  data: 0.0918  max mem: 1002

run 2 output:

Epoch: [0]  [   0/3684]  eta: 0:16:12  lr: 0.000200  loss: 2.0382 (2.0382)  loss_classifier: 1.3432 (1.3432)  loss_box_reg: 0.0000 (0.0000)  loss_objectness: 0.6915 (0.6915)  loss_rpn_box_reg: 0.0035 (0.0035)  time: 0.2640  data: 0.0829  max mem: 715
Epoch: [0]  [   1/3684]  eta: 0:16:46  lr: 0.000300  loss: 2.0382 (2.0410)  loss_classifier: 1.3432 (1.3464)  loss_box_reg: 0.0000 (0.0000)  loss_objectness: 0.6900 (0.6907)  loss_rpn_box_reg: 0.0035 (0.0038)  time: 0.2732  data: 0.1137  max mem: 1001
Epoch: [0]  [   2/3684]  eta: 0:13:54  lr: 0.000400  loss: 2.0405 (2.0408)  loss_classifier: 1.3432 (1.3372)  loss_box_reg: 0.0000 (0.0095)  loss_objectness: 0.6900 (0.6902)  loss_rpn_box_reg: 0.0040 (0.0039)  time: 0.2267  data: 0.0848  max mem: 1002
Epoch: [0]  [   3/3684]  eta: 0:13:45  lr: 0.000500  loss: 2.0382 (2.0330)  loss_classifier: 1.3188 (1.3287)  loss_box_reg: 0.0000 (0.0072)  loss_objectness: 0.6900 (0.6903)  loss_rpn_box_reg: 0.0040 (0.0069)  time: 0.2243  data: 0.0908  max mem: 1002
Epoch: [0]  [   4/3684]  eta: 0:12:38  lr: 0.000600  loss: 2.0382 (2.0154)  loss_classifier: 1.3188 (1.3135)  loss_box_reg: 0.0000 (0.0057)  loss_objectness: 0.6900 (0.6896)  loss_rpn_box_reg: 0.0041 (0.0065)  time: 0.2061  data: 0.0781  max mem: 1002
Epoch: [0]  [   5/3684]  eta: 0:12:27  lr: 0.000699  loss: 2.0094 (1.9924)  loss_classifier: 1.3030 (1.2909)  loss_box_reg: 0.0000 (0.0048)  loss_objectness: 0.6891 (0.6891)  loss_rpn_box_reg: 0.0041 (0.0076)  time: 0.2032  data: 0.0785  max mem: 1002
Epoch: [0]  [   6/3684]  eta: 0:12:43  lr: 0.000799  loss: 2.0094 (1.9581)  loss_classifier: 1.3030 (1.2569)  loss_box_reg: 0.0000 (0.0041)  loss_objectness: 0.6891 (0.6883)  loss_rpn_box_reg: 0.0052 (0.0087)  time: 0.2077  data: 0.0855  max mem: 1002

If I run my script with the nightly version and comment out torch.use_deterministic_algorithms(True):

(tchnite) $ time CUDA_VISIBLE_DEVICES=1 python train.py --epochs 10 --output-dir exps/tmp --lr 0.1 --workers 0 --batch-size 1 --print-freq 1

It runs seemingly without error, but again the losses aren’t the same across runs, so it seems the second error seems important, but I’m not sure how to fix it.

I think it’s a good idea to skip the distributed run for now to reduce the scope of the issue to the reproducibility. Based on the nightly error, it seems as if a deterministic version of a pooling layer is causing an error. Could you post the model definition or the pooling layers with their input shapes so that we could debug it further?

I’m trying to figure out how to best answer this. What exactly do you mean by model definition? Altogether, faster_rcnn is spread across multiple files, and most of them are in torchvision. At this point, my code makes few modifications to it, but it does some.

To try to aid in sharing the model, I am trying the following, which you should be able to reproduce easily enough. This should also isolate the issue to something in torchvision rather than my code. Here’s my script, simple_train.py:

import torch
import torchvision
import numpy as np
import random
import os
import math
import sys

NCLASS = 3


def reduce_dict(input_dict, average=True):
    """
    Args:
        input_dict (dict): all the values will be reduced
        average (bool): whether to do average or sum
    Reduce the values in the dictionary from all processes so that all processes
    have the averaged results. Returns a dict with the same fields as
    input_dict, after reduction.
    """
    # world_size = get_world_size()
    world_size = 1
    if world_size < 2:
        return input_dict
    with torch.no_grad():
        names = []
        values = []
        # sort the keys so that they are consistent across processes
        for k in sorted(input_dict.keys()):
            names.append(k)
            values.append(input_dict[k])
        values = torch.stack(values, dim=0)
        dist.all_reduce(values)
        if average:
            values /= world_size
        reduced_dict = {k: v for k, v in zip(names, values)}
    return reduced_dict

def generate_target(ind):

    # nboxes = random.randint(1, 3)
    boxes = []
    labels = []
    nboxes = 2
    for i in range(nboxes):
        xtl = 1
        xbr = 2
        ytl = 3
        ybr = 4
        box = [xtl, ytl, xbr, ybr]
        label = 2
        boxes.append(box)
        labels.append(label)

    one_hot = torch.zeros(NCLASS)
    imid = torch.tensor([ind])
    iscrowd = torch.zeros(len(boxes))
    boxes = torch.as_tensor(boxes, dtype=torch.float32)
    labels = torch.as_tensor(labels, dtype=torch.int64)
    area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])

    target = {'one_hot': one_hot, 'id': imid, 'boxes': boxes, 'labels': labels, 'image_id': imid, 'area': area, 'iscrowd': iscrowd}
    return target

seed = 1
'''
torch.use_deterministic_algorithms(True) # , in beta....
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
os.environ['PYTHONHASHSEED'] = str(seed)
'''
np.random.seed(seed)
random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

model = torchvision.models.detection.__dict__['fasterrcnn_resnet50_fpn'](num_classes=NCLASS, pretrained=0)
device = torch.device('cuda')
model.to(device)

'''
# can see all_the_outs[0] and [1] are the same
all_the_outs = []
model.eval()
for h in range(2):
    all_outs = []
    for i in range(10):
        images = torch.full([1, 3, 540, 1920], i/100)
        outputs = model(images)
        all_outs.append(outputs)
    all_the_outs.append(all_outs)
'''
model.train()

params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(
    params, lr=0.00001, momentum=0.9, weight_decay=1e-4)

nsteps = 100
for i in range(nsteps):
    images = torch.full([1, 3, 540, 1920], i/100)
    images = list(image.to(device) for image in images)
    targets = []
    for j in range(len(images)):
        targets.append(generate_target(j))
    targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

    loss_dict = model(images, targets)

    losses = sum(loss for loss in loss_dict.values())

    # reduce losses over all GPUs for logging purposes
    # loss_dict_reduced = utils.reduce_dict(loss_dict)
    loss_dict_reduced = reduce_dict(loss_dict)
    losses_reduced = sum(loss for loss in loss_dict_reduced.values())

    loss_value = losses_reduced.item()

    if not math.isfinite(loss_value):
        print("Loss is {}, stopping training".format(loss_value))
        print(loss_dict_reduced)
        sys.exit(1)

    optimizer.zero_grad()
    losses.backward()
    optimizer.step()
    
    '''
    if lr_scheduler is not None:
	lr_scheduler.step()
    '''
print('done')

This works fine and prints done. However, as soon as I uncomment the lines about deterministic behavior I get errors.

Torch nightly error:

$ CUDA_VISIBLE_DEVICES=0 python simple_train.py
/home/austin/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and
subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
  what():  linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor8761
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f0dfecd1302 in /home/austin/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f0dfeccdc9b in /home/austin/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7f0dfecce18e in /home/austin/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7f0e00c5c268 in /home/austin/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7f0e42b6e423 in /home/austin/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)

Aborted (core dumped)

torch 1.8.1 complains about CUBLAS_WORKSPACE_CONFIG even though it is set

(tch18) $ CUBLAS_WORKSPACE_CONFIG=:16:8
(tch18) $ CUDA_VISIBLE_DEVICES=1 python simple_train.py 
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
Traceback (most recent call last):
  File "simple_train.py", line 107, in <module>
    loss_dict = model(images, targets)
  File "/home/austin/.virtualenvs/tch18/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/austin/.virtualenvs/tch18/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 98, in forward
    detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
  File "/home/austin/.virtualenvs/tch18/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/austin/.virtualenvs/tch18/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py", line 753, in forward
    box_features = self.box_head(box_features)
  File "/home/austin/.virtualenvs/tch18/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/austin/.virtualenvs/tch18/lib/python3.7/site-packages/torchvision/models/detection/faster_rcnn.py", line 258, in forward
    x = F.relu(self.fc6(x))
  File "/home/austin/.virtualenvs/tch18/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/austin/.virtualenvs/tch18/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/austin/.virtualenvs/tch18/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
(tch18) austin@skywalker:/media/ssd1/austin/invert_counter/detection$ echo CUBLAS_WORKSPACE_CONFIG
CUBLAS_WORKSPACE_CONFIG
(tch18) $ echo $CUBLAS_WORKSPACE_CONFIG
:16:8

Let me know if you are able to reproduce these results or if I’m still missing something.

Hi I just wanted to check in and see if anyone has been able to reproduce this error?

Your current code raises another error, so that I wasn’t able to reproduce the cublas issue with it:

RuntimeError: index_put_ with accumulate=False does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

However, this simple code raises the same error message:

import torch

torch.use_deterministic_algorithms(True)
x = torch.randn(1024, 1024, device='cuda')
y = torch.matmul(x, x)
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

and works fine with the env variable:

CUBLAS_WORKSPACE_CONFIG=:16:8 python script.py 
/opt/pytorch/pytorch/torch/__init__.py:470: UserWarning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (Triggered internally at  ../aten/src/ATen/Context.cpp:69.)
  _C._set_deterministic_algorithms(mode)

Hi, I have a similar issue which I posted here: Can't achive reproducability / determinism in pytorch training - #8 by Ecem_sogancioglu

I use FasterRCNN PyTorch implementation, I updated PyTorch to nightly release and set torch.use_deterministic_algorithms(True). I also set the environmental variable CUBLAS_WORKSPACE_CONFIG=:16:8, however, the results are still not reproducible. When trained on the CPU, the results are reproducible.