BatchNorm and ReLU

Model(
  (multi_tdnn): MultiTDNN(
    (multi_tdnn): Sequential(
      (0): TDNN(
        (conv): Conv1d(60, 512, kernel_size=(5,), stride=(1,))
        (nonlinearity): ReLU()
        (bn): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): TDNN(
        (conv): Conv1d(512, 512, kernel_size=(3,), stride=(1,), dilation=(2,))
        (nonlinearity): ReLU()
        (bn): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (2): TDNN(
        (conv): Conv1d(512, 512, kernel_size=(3,), stride=(1,), dilation=(3,))
        (nonlinearity): ReLU()
        (bn): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (3): TDNN(
        (conv): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
        (nonlinearity): ReLU()
        (bn): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (4): TDNN(
        (conv): Conv1d(512, 1500, kernel_size=(1,), stride=(1,))
        (nonlinearity): ReLU()
        (bn): BatchNorm1d(1500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
  )
  (stats_pool): StatsPool()
  (linear1): Linear(in_features=3000, out_features=512, bias=True)
  (linear2): Linear(in_features=512, out_features=1739, bias=True)
  (nonlinearity): ReLU()
)

I am using above architecture for my experimentation. But, it gives me NaN vales. From my analysis, I think ReLU and BatchNorm in TDNN layer is causing problem.

Conv1d --> ReLU --> BatchNorm1d <== Gives NaN
Conv1d --> Tanh --> BatchNorm1d <== Working Perfectly
Conv1d --> ReLU --> Dropout <== Gives NaN
Conv1d --> Tanh --> Dropout <== Working Perfectly
Conv1d --> ReLU --> BatchNorm1d --> Dropout <== Working Perfectly

Can someone please guide me on this issue?

ReLU might be zeroing many of your activations, leading to very low variance on the activations which in turn leads to NaN’s (due to 0 by 0 division). Check the way you are scaling the input data and/or try another activation function (e.g. leaky relu).

@dpernes, Thank you for your reply.
I have also think of these that zeros by ReLU can lead to division by zero. But then I have checked the PyTorch implementation of BatchNorm1d, and I can see that they have added eps to variance to overcome this.

Could you give us some information about your setup?
I.e. which PyTorch, CUDA, cudnn versions are you using?
Are you seeing these NaNs on the CPU, GPU or both?
Are you able to reproduce the NaNs quickly (in a couple of iterations)?

@ptrblck Thank you for your reply.

Below are details about configuration.
PyTorch --> 1.4.0
CUDA --> 10.2

I am not able to locate CuDNN. If CuDNN comes inbuilt with the CUDA, then it might be installed, but I haven’t installed it myself.

Yes, I have checked on both GPU and CPU I am getting NaN just after 5 to 10 iterations (i.e. 5 to 10 batches).

Could you check the inputs for NaNs and Infs, please?
I assume the NaNs are returned during training?
Could you store the inputs with the model’s state_dict, so that we could reproduce it in a single iteration?

Might sound silly but, try reduce learning rate. And also try using gradient clipping as well -

loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

optimizer.step()

@ptrblck, Thank you for reply.

Could you check the inputs for NaNs and Infs, please?
I assume the NaNs are returned during training?

Yes, NaN coming during training. Currently, I have already trained my model with Conv1d → ReLU → BatchNorm → Dropout setup for TDNN block for 6 epochs without any problem. I have also successfully trained another LSTM based architecture on same data. So I believe there will be no NaN and Inf.

Could you store the inputs with the model’s state_dict , so that we could reproduce it in a single iteration?

You want script and sample data from my side to reproduce the same by you, am I correct?

Yes, that is correct.
We are seeing more NaN issues recently, which could be solved so far, but it’s still a red flag if these issues are posted frequently, so that I would like to make sure we are not missing an internal bug.

Sure. I will get back to you with scripts and data asap. Thank You.

@braindotai, Thank You for suggestion.
I tried clip_grad_norm, but it doesn’t solve the problem.

My pleasure, another suggestion though - make sure you have normalized the data.

@ptrblck I have create repository on GitHub which consist of script and sample data. code

Below are some basic detail about my architecture.
The architecture is for multiclass classification. The full architecture I have already posted.
The architecture consist of TDNN layers which works at frame level (input data is of shape (200, 20)). StatsPool layer computes mean and std and combination of both gives us single 3000 dimensional vector. FInally, three Linear layers where last linear layers have 1739 output as I have that many number of classes. I have used CrossEntropyLoss as criterian.

Please let me know if you have any query regarding architecture and data I have used.

Thanks for the code!
I can reproduce the NaNs and isolated it to an Inf gradient after the std operation.
Not sure, what’s causing it at the moment, but will debug further.

@ptrblck, Thank you for your time.
If you add Dropout layer after BatchNorm1d i.e. (Conv1d --> ReLU --> BatchNorm1d --> Dropout), it works like charm.
I have also checked PReLU as activation function (Conv1d --> PReLU --> BatchNorm1d), which again works perfectly, only ReLU is causing problem. I have successfully trained model with this setup.

I hope this might help you to debug further.

2 Likes

@ptrblck, any updates on this issue?? Thank You.

Thanks for pinging. Not yet, I’ll try to take a look at it later this evening.

Sure. Please take your time.

From the square root, the derivative of std is infinity at std=0. I would recommend working around it by using a backward hook and where. Or you could spell out the std and regularize before the square root.

Best regards

Thomas

1 Like

@tom is correct. The output of std indeed yields a zero output in the forward pass, which will yield the Inf gradients.

EDIT: This line in particular raises the issue.