Applying mask caused NaN grad

I was trying to do text matching task which needs to construct an interaction grid S , each element S_ij is a cossim(x_i, y_j) that is S_{ij} = cossim(x_i, y_j).
The x, y are extracted embeddings with:
x.size() = (BatchSize, xLen, emb_dim),
y.size() = (BatchSize, yLen, emb_dim).
To do Batch cossim to obtain a S whose size is (BS, xLen, yLen) (S_{ij}= cossim(x_i, y_j)), I wrote

def cossim(X, Y):
    """ calculate the cos similarity between X and Y: cos(X, Y)
        X: tensor (BS, x_len, hidden_size)
        Y: tensor (BS, y_len, hidden_size)
        returns: (BS, x_len, y_len)
    X_norm = torch.sqrt(torch.sum(X ** 2, dim=2)).unsqueeze(2)  # (BS, x_len, 1)
    Y_norm = torch.sqrt(torch.sum(Y ** 2, dim=2)).unsqueeze(1)  # (BS, 1, y_len)
    S = torch.bmm(X, Y.transpose(1,2)) / (X_norm * Y_norm + 1e-5)
    return S

I know that cossim at (0, 0) is not differentiable, so I initialized the padding embedding to be a random vector (non zero)
In the input x, and y (x, y are packed into Variable) ,the last several words are paddings so I must apply mask to filter the padding vector out by:

x_mask =, 0).unsqueeze(2).float() # since x is packed as Variable, x_mask is also a Variable
y_mask =, 0).unsqueeze(2).float()
x = x * x_mask
y = y * y_mask

After one back prop the x, y will have some part as NaN
If I don’t apply mask, there will be no NaN problem:

x = ...
y = ...

I detached the x_mask and y_mask by 2 methods to make it as non-learnable parameter so that they won’t mess up the gradients by:
(1)x, y are variables, extract, then generate mask, then pack mask into variable and apply mask

x_mask =, 0).unsqueeze(2).float() # since x is packed as Variable, x_mask is also a Variable
y_mask =, 0).unsqueeze(2).float()
x_mask = Variable(x_mask, requires_grad=False)
y_mask = Variable(y_mask, requires_grad=False)
x = x * x_mask
y = y * y_mask

(2)x, y are variables, use x_mask = x_mask.detach()

x_mask =, 0).unsqueeze(2).float() # since x is packed as Variable, x_mask is also a Variable
y_mask =, 0).unsqueeze(2).float()
x_mask = x_mask.detach()
y_mask = x_mask.detach()
x = x * x_mask
y = y * y_mask

But no use, they both returned NaN in x, and y after one backprop
When I removed the multiplication of the mask, everything is ok.
Is there a problem of Cossim or the mask? How can I achieve this?

1 Like has gradient zero almost everywhere and gradient undefined when x == 0. In practice, if x == 0 pytorch returns 0 as gradient of Therefore detaching x_mask is not useful.

x * x_mask is basically an identity mapping for some elements of x in which case the gradients flow through unmodified, or a zero mapping in which case the gradients are blocked.

I can’t see why you might get NaN with the mask and not without. Could you post some more code?

Dear jpeg729,
Thanks for help:

def forward(self, q, d_pos, d_neg):
        """ apply rel
            q: LongTensor (BS, qlen) Variable input
            d: LongTensor (BS, dlen) Variable input
            returns R1, R2, R3: relevance of 3 level (BS,)
        q_mask =, 0).unsqueeze(2).float()  # (BS, qlen, 1)
        q_mask = q_mask.detach()
        d_pos_mask =, 0).unsqueeze(2).float()  # (BS, dlen, 1)
        d_pos_mask = d_pos_mask.detach()
        d_neg_mask =, 0).unsqueeze(2).float()  # (BSm dlen, 1)
        d_neg_mask = d_neg_mask.detach()
        q_emb = self.emb_mod(q) * q_mask  # (BS, qlen, emb_size)
        d_pos_emb = self.emb_mod(d_pos) * d_pos_mask  # (BS, dlen, emb_size)
        d_neg_emb = self.emb_mod(d_neg) * d_neg_mask  # (BS, dlen, emb_size)
        # do convs
        q_conved1 = self.q_conv1(q_emb.transpose(1, 2))  # (BS, hs, qLen1)
        q_conved2 = self.q_conv2(q_conved1)  # (BS, hs, 1)
        d_pos_conved1 = self.d_conv1(d_pos_emb.transpose(1, 2))  # (BS, hs, dLen1)
        d_pos_conved2 = self.d_conv2(d_pos_conved1)  # (BS, hs, 1)
        d_neg_conved1 = self.d_conv1(d_neg_emb.transpose(1, 2))  # (BS, hs, dLen1)
        d_neg_conved2 = self.d_conv2(d_neg_conved1)  # (BS, hs, 1)
         # interactions matrices
        if self.sim_type == "Dot" or self.sim_type == "Cos":
            interact_pos_mat1 = self.sim(q_emb, d_pos_emb)  # (BS, qlen, dlen)
            interact_pos_mat2 = self.sim(q_conved1.transpose(1, 2),
                                         d_pos_conved1.transpose(1, 2))  # (BS, qLen1, dLen1)
            interact_neg_mat1 = self.sim(q_emb, d_neg_emb)  # (BS, qlen, dlen)
            interact_neg_mat2 = self.sim(q_conved1.transpose(1, 2),
                                         d_neg_conved1.transpose(1, 2))  # (BS, qLen1, dLen1)
       # ..... some ops on the interaction matrices
       #. to get scores S_pos, S_neg
       return S_pos, S_neg

Actually I do need these masks, because I randomly initialized the PAD embedding vector and since the PAD embedding vector will also be learnable, I must mask it to 0 vector and then do the COS interaction, otherwise there will be non 0 interaction signal cos(PAD, PAD) which will cause a lot of noises in my model.

Dear Jpeg729,
I enclosed some screenshots from terminal:
with mask d_pos_emb before first update
d_pos_emb Variable containing:
( 0 ,.,.) =
-0.1051 0.1341 0.1384 … -0.4195 0.0894 0.1757
0.2043 0.6410 0.2800 … 0.3296 0.9028 0.0257
-0.3194 0.1362 -0.0486 … -0.0811 0.3453 0.1107
… ⋱ …
-0.0000 -0.0000 -0.0000 … 0.0000 0.0000 -0.0000
-0.0000 -0.0000 -0.0000 … 0.0000 0.0000 -0.0000
-0.0000 -0.0000 -0.0000 … 0.0000 0.0000 -0.0000
with mask d_pos_emb after first update

wihout mask d_pos_emb before first update
d_pos_emb Variable containing:
( 0 ,.,.) =
-1.2080e-01 5.8023e-01 -3.1423e-02 … 4.0058e-01 2.9359e-01 1.4976e-01
-1.2080e-01 5.8023e-01 -3.1423e-02 … 4.0058e-01 2.9359e-01 1.4976e-01
-1.2080e-01 5.8023e-01 -3.1423e-02 … 4.0058e-01 2.9359e-01 1.4976e-01
… ⋱ …
-3.0333e-01 -2.1788e-01 -7.2018e-01 … 1.8724e-01 5.1078e-01 -5.1949e-01
-3.0333e-01 -2.1788e-01 -7.2018e-01 … 1.8724e-01 5.1078e-01 -5.1949e-01
-3.0333e-01 -2.1788e-01 -7.2018e-01 … 1.8724e-01 5.1078e-01 -5.1949e-01
without mask d_pos_emb after first update
( 0 ,.,.) =
-2.3951e-01 -3.3803e-01 -2.6433e-01 … 3.2791e-01 3.6259e-02 2.3877e-01
-2.0393e-01 1.3825e-01 5.7552e-02 … 2.7366e-01 -5.1705e-01 3.0394e-01
-5.5689e-01 1.9376e-01 -3.0927e-01 … 3.2666e-01 -1.9124e-01 -2.3107e-01
… ⋱ …
-3.0234e-01 -2.1688e-01 -7.1918e-01 … 1.8624e-01 5.1178e-01 -5.2049e-01
-3.0234e-01 -2.1688e-01 -7.1918e-01 … 1.8624e-01 5.1178e-01 -5.2049e-01
-3.0234e-01 -2.1688e-01 -7.1918e-01 … 1.8624e-01 5.1178e-01 -5.2049e-01

It looks like you calculate the cosine similarity after applying the mask, could that be the source of the NaNs?

Dear Jpeg729,
Yeah, My goal is to use the masked embeddings sequence to calculate the cos similarity grid.
So that the in the interaction grid S(x_i, y_j) will have 0s in (PAD, PAD) positions.
Maybe I should apply a 2D mask on the interaction grid S? by multiplying for example a 2D mask:
But it’s more complicated and always have the problem of S * S_mask problem with non differentiable mask…

The problem is resolved, I replaced my custom Cos sim function by pytorch F.normalize() and then simply dot the 2 normalized x and y. There might be some stability issue of my function, even if I did pay attention to add 1e-12 to the denominators…

1 Like

This is a function

func = torch.sqrt((dudx - dudy)**2 + (dudy -dudx)**2)
where dudx, dudy is derivative of U wrt input x and y
Sine the square root of zero is infinite, so I want to impute some values of func where it is less than 0 by 0.1, and applied following function for it:
func[(func<=0.0).detach()] = 0.1
This is func value i got

> tensor([1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01,
>         1.2064e-05, 4.1480e-06, 5.6752e-06, 1.0000e-01, 1.0000e-01, 9.2189e-06,
>         3.1551e-06, 4.3206e-06, 1.0000e-01, 1.0000e-01, 7.0325e-06, 2.4059e-06,
>         3.3041e-06, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01,
>         1.0000e-01], grad_fn=<IndexPutBackward>)

Now if i try to take derivative of func wrt inputs I get nan in place of imputed values as:

compute_grad(func, inputs)[:,0].reshape(nx,ny)

tensor([[        nan,         nan,         nan,         nan,         nan],
        [        nan, -1.7759e-05, -6.2834e-06, -8.6144e-06,         nan],
        [        nan, -1.3896e-05, -4.7949e-06, -6.5147e-06,         nan],
        [        nan, -1.0686e-05, -3.6399e-06, -4.9337e-06,         nan],
        [        nan,         nan,         nan,         nan,         nan]],

Can someone help me to overcome this . I want the derivative of func without getting nan. Any help is appreciated. @smth, @ptrblck could you please have a look

I’m having the same problem when I’m trying to implement variational dropout, when using the same mask over and over again.
Tried mask.detach() and also Variable(mask, requires_grad=False). Even tried to clone the mask every forward pass - Still getting nans after few iterations…
Check my post for more information… implementing-variational-dropout-cause-nan-values.