Non-deterministic training (gradient update) in Resnet

harryZZZ · November 20, 2019, 6:03pm

Hello I am implementing a resnet 3d model in pytorch but constantly get non-deterministic result.
I do have all seed set up:
def seed_torch(seed=123):
random.seed(seed)
os.environ[‘PYTHONHASHSEED’] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

def _init_fn(worker_id):
seed_torch(SEED)

I also change the model to Alexnet then the result is deterministic, so there must be some functions inside Resnet that cause this issue.

I am using the resnet 3D from:

github.com

kenshohara/3D-ResNets-PyTorch/blob/master/models/resnet.py

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import math
from functools import partial

__all__ = [
    'ResNet', 'resnet10', 'resnet18', 'resnet34', 'resnet50', 'resnet101',
    'resnet152', 'resnet200'
]


def conv3x3x3(in_planes, out_planes, stride=1):
    # 3x3x3 convolution with padding
    return nn.Conv3d(
        in_planes,
        out_planes,
        kernel_size=3,
        stride=stride,

This file has been truncated. show original

I tried both wide resnet and resnet, 18 and 50, all have the same issue

tom · November 20, 2019, 6:29pm

Probably the backward of the downsampling:

https://pytorch.org/docs/stable/notes/randomness.html

Seems to be one of the things everyone wants to have but nobody takes the time to do.

Best regards

Thomas

harryZZZ · November 20, 2019, 6:56pm

Do you mean in shortcut A, the down_sample_basic_block?
I used shortcut B, which the downsample is basically a conv + batchnorm

tom · November 20, 2019, 7:03pm

Did you click on the link?

A number of operations have backwards that use atomicAdd , in particular torch.nn.functional.embedding_bag() , torch.nn.functional.ctc_loss() and many forms of pooling, padding, and sampling. There currently is no simple way of avoiding non-determinism in these functions.

From a cursory reading of the code, you use pooling.

Best regards

Thomas

harryZZZ · November 20, 2019, 7:06pm

yes I did read that although it is very vague…
but I also use max pooling/avg pooling 3d in Alexnet, which is fine, so I figure pooling is not the issue.

add more info:
for 2d models I work with before, resnet has no such issue

tom · November 20, 2019, 10:00pm

The implementations can differ, both for the networks on top of the ops and the ops themselves. For example 3d avg pooling seems to use atomicAdd for cases where the windows can overlap, so something like

self.avgpool = nn.AvgPool3d((last_duration, last_size, last_size), stride=1)

is suspect.

Best regards

Thomas

harryZZZ · November 20, 2019, 10:30pm

Thank you Tom, I tried to substitute this with maxpool 3d but still nondeterministic…
however maxpool 3d is deterministic in my alexnet
I will continue to explore other layer’s behavior