BatchNorm and ReLU

divyesh_rajpura · April 13, 2020, 12:07pm

Model(
  (multi_tdnn): MultiTDNN(
    (multi_tdnn): Sequential(
      (0): TDNN(
        (conv): Conv1d(60, 512, kernel_size=(5,), stride=(1,))
        (nonlinearity): ReLU()
        (bn): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): TDNN(
        (conv): Conv1d(512, 512, kernel_size=(3,), stride=(1,), dilation=(2,))
        (nonlinearity): ReLU()
        (bn): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (2): TDNN(
        (conv): Conv1d(512, 512, kernel_size=(3,), stride=(1,), dilation=(3,))
        (nonlinearity): ReLU()
        (bn): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (3): TDNN(
        (conv): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
        (nonlinearity): ReLU()
        (bn): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (4): TDNN(
        (conv): Conv1d(512, 1500, kernel_size=(1,), stride=(1,))
        (nonlinearity): ReLU()
        (bn): BatchNorm1d(1500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
  )
  (stats_pool): StatsPool()
  (linear1): Linear(in_features=3000, out_features=512, bias=True)
  (linear2): Linear(in_features=512, out_features=1739, bias=True)
  (nonlinearity): ReLU()
)

I am using above architecture for my experimentation. But, it gives me NaN vales. From my analysis, I think ReLU and BatchNorm in TDNN layer is causing problem.

Conv1d --> ReLU --> BatchNorm1d <== Gives NaN
Conv1d --> Tanh --> BatchNorm1d <== Working Perfectly
Conv1d --> ReLU --> Dropout <== Gives NaN
Conv1d --> Tanh --> Dropout <== Working Perfectly
Conv1d --> ReLU --> BatchNorm1d --> Dropout <== Working Perfectly

Can someone please guide me on this issue?

dpernes · April 13, 2020, 1:53pm

ReLU might be zeroing many of your activations, leading to very low variance on the activations which in turn leads to NaN’s (due to 0 by 0 division). Check the way you are scaling the input data and/or try another activation function (e.g. leaky relu).

divyesh_rajpura · April 13, 2020, 2:12pm

@dpernes, Thank you for your reply.
I have also think of these that zeros by ReLU can lead to division by zero. But then I have checked the PyTorch implementation of BatchNorm1d, and I can see that they have added eps to variance to overcome this.

ptrblck · April 14, 2020, 1:33am

Could you give us some information about your setup?
I.e. which PyTorch, CUDA, cudnn versions are you using?
Are you seeing these NaNs on the CPU, GPU or both?
Are you able to reproduce the NaNs quickly (in a couple of iterations)?

divyesh_rajpura · April 14, 2020, 4:04am

@ptrblck Thank you for your reply.

Below are details about configuration.
PyTorch --> 1.4.0
CUDA --> 10.2

I am not able to locate CuDNN. If CuDNN comes inbuilt with the CUDA, then it might be installed, but I haven’t installed it myself.

Yes, I have checked on both GPU and CPU I am getting NaN just after 5 to 10 iterations (i.e. 5 to 10 batches).

ptrblck · April 14, 2020, 6:25am

Could you check the inputs for NaNs and Infs, please?
I assume the NaNs are returned during training?
Could you store the inputs with the model’s state_dict, so that we could reproduce it in a single iteration?

braindotai · April 14, 2020, 6:40am

Might sound silly but, try reduce learning rate. And also try using gradient clipping as well -

loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

optimizer.step()

divyesh_rajpura · April 14, 2020, 7:51am

@ptrblck, Thank you for reply.

Could you check the inputs for NaNs and Infs, please?
I assume the NaNs are returned during training?

Yes, NaN coming during training. Currently, I have already trained my model with Conv1d → ReLU → BatchNorm → Dropout setup for TDNN block for 6 epochs without any problem. I have also successfully trained another LSTM based architecture on same data. So I believe there will be no NaN and Inf.

Could you store the inputs with the model’s state_dict , so that we could reproduce it in a single iteration?

You want script and sample data from my side to reproduce the same by you, am I correct?

ptrblck · April 14, 2020, 7:56am

Yes, that is correct.
We are seeing more NaN issues recently, which could be solved so far, but it’s still a red flag if these issues are posted frequently, so that I would like to make sure we are not missing an internal bug.

divyesh_rajpura · April 14, 2020, 7:57am

Sure. I will get back to you with scripts and data asap. Thank You.

divyesh_rajpura · April 14, 2020, 7:59am

@braindotai, Thank You for suggestion.
I tried clip_grad_norm, but it doesn’t solve the problem.

braindotai · April 14, 2020, 8:31am

My pleasure, another suggestion though - make sure you have normalized the data.

divyesh_rajpura · April 14, 2020, 10:39am

@ptrblck I have create repository on GitHub which consist of script and sample data. code

Below are some basic detail about my architecture.
The architecture is for multiclass classification. The full architecture I have already posted.
The architecture consist of TDNN layers which works at frame level (input data is of shape (200, 20)). StatsPool layer computes mean and std and combination of both gives us single 3000 dimensional vector. FInally, three Linear layers where last linear layers have 1739 output as I have that many number of classes. I have used CrossEntropyLoss as criterian.

Please let me know if you have any query regarding architecture and data I have used.

ptrblck · April 15, 2020, 9:35am

Thanks for the code!
I can reproduce the NaNs and isolated it to an Inf gradient after the std operation.
Not sure, what’s causing it at the moment, but will debug further.

divyesh_rajpura · April 15, 2020, 12:13pm

@ptrblck, Thank you for your time.
If you add Dropout layer after BatchNorm1d i.e. (Conv1d --> ReLU --> BatchNorm1d --> Dropout), it works like charm.
I have also checked PReLU as activation function (Conv1d --> PReLU --> BatchNorm1d), which again works perfectly, only ReLU is causing problem. I have successfully trained model with this setup.

I hope this might help you to debug further.

divyesh_rajpura · April 26, 2020, 11:08am

@ptrblck, any updates on this issue?? Thank You.

ptrblck · April 26, 2020, 11:05pm

Thanks for pinging. Not yet, I’ll try to take a look at it later this evening.

divyesh_rajpura · April 27, 2020, 5:52am

Sure. Please take your time.

tom · April 27, 2020, 6:09am

From the square root, the derivative of std is infinity at std=0. I would recommend working around it by using a backward hook and where. Or you could spell out the std and regularize before the square root.

Best regards

Thomas

ptrblck · April 27, 2020, 8:22am

@tom is correct. The output of std indeed yields a zero output in the forward pass, which will yield the Inf gradients.

EDIT: This line in particular raises the issue.