Applying mask caused NaN grad

nyfbb · March 26, 2018, 3:47am

I was trying to do text matching task which needs to construct an interaction grid S , each element S_ij is a cossim(x_i, y_j) that is S_{ij} = cossim(x_i, y_j).
The x, y are extracted embeddings with:
x.size() = (BatchSize, xLen, emb_dim),
y.size() = (BatchSize, yLen, emb_dim).
To do Batch cossim to obtain a S whose size is (BS, xLen, yLen) (S_{ij}= cossim(x_i, y_j)), I wrote

def cossim(X, Y):
    """ calculate the cos similarity between X and Y: cos(X, Y)
        X: tensor (BS, x_len, hidden_size)
        Y: tensor (BS, y_len, hidden_size)
        returns: (BS, x_len, y_len)
    """
    X_norm = torch.sqrt(torch.sum(X ** 2, dim=2)).unsqueeze(2)  # (BS, x_len, 1)
    Y_norm = torch.sqrt(torch.sum(Y ** 2, dim=2)).unsqueeze(1)  # (BS, 1, y_len)
    S = torch.bmm(X, Y.transpose(1,2)) / (X_norm * Y_norm + 1e-5)
    return S

I know that cossim at (0, 0) is not differentiable, so I initialized the padding embedding to be a random vector (non zero)
In the input x, and y (x, y are packed into Variable) ,the last several words are paddings so I must apply mask to filter the padding vector out by:

x_mask = torch.ne(x, 0).unsqueeze(2).float() # since x is packed as Variable, x_mask is also a Variable
y_mask = torch.ne(y, 0).unsqueeze(2).float()
x = x * x_mask
y = y * y_mask

After one back prop the x, y will have some part as NaN
If I don’t apply mask, there will be no NaN problem:

x = ...
y = ...

I detached the x_mask and y_mask by 2 methods to make it as non-learnable parameter so that they won’t mess up the gradients by:
(1)x, y are variables, extract x.data, y.data then generate mask, then pack mask into variable and apply mask

x_mask = torch.ne(x.data, 0).unsqueeze(2).float() # since x is packed as Variable, x_mask is also a Variable
y_mask = torch.ne(y.data, 0).unsqueeze(2).float()
x_mask = Variable(x_mask, requires_grad=False)
y_mask = Variable(y_mask, requires_grad=False)
x = x * x_mask
y = y * y_mask

(2)x, y are variables, use x_mask = x_mask.detach()

x_mask = torch.ne(x, 0).unsqueeze(2).float() # since x is packed as Variable, x_mask is also a Variable
y_mask = torch.ne(y, 0).unsqueeze(2).float()
x_mask = x_mask.detach()
y_mask = x_mask.detach()
x = x * x_mask
y = y * y_mask

But no use, they both returned NaN in x, and y after one backprop
When I removed the multiplication of the mask, everything is ok.
Is there a problem of Cossim or the mask? How can I achieve this?
Thanks

jpeg729 · March 26, 2018, 8:37am

torch.ne has gradient zero almost everywhere and gradient undefined when x == 0. In practice, if x == 0 pytorch returns 0 as gradient of torch.ne. Therefore detaching x_mask is not useful.

x * x_mask is basically an identity mapping for some elements of x in which case the gradients flow through unmodified, or a zero mapping in which case the gradients are blocked.

I can’t see why you might get NaN with the mask and not without. Could you post some more code?

nyfbb · March 26, 2018, 2:11pm

Dear jpeg729,
Thanks for help:

def forward(self, q, d_pos, d_neg):
        """ apply rel
            q: LongTensor (BS, qlen) Variable input
            d: LongTensor (BS, dlen) Variable input
            returns R1, R2, R3: relevance of 3 level (BS,)
        """
        q_mask = torch.ne(q, 0).unsqueeze(2).float()  # (BS, qlen, 1)
        q_mask = q_mask.detach()
        d_pos_mask = torch.ne(d_pos, 0).unsqueeze(2).float()  # (BS, dlen, 1)
        d_pos_mask = d_pos_mask.detach()
        d_neg_mask = torch.ne(d_neg, 0).unsqueeze(2).float()  # (BSm dlen, 1)
        d_neg_mask = d_neg_mask.detach()
        q_emb = self.emb_mod(q) * q_mask  # (BS, qlen, emb_size)
        d_pos_emb = self.emb_mod(d_pos) * d_pos_mask  # (BS, dlen, emb_size)
        d_neg_emb = self.emb_mod(d_neg) * d_neg_mask  # (BS, dlen, emb_size)
        # do convs
        q_conved1 = self.q_conv1(q_emb.transpose(1, 2))  # (BS, hs, qLen1)
        q_conved2 = self.q_conv2(q_conved1)  # (BS, hs, 1)
        d_pos_conved1 = self.d_conv1(d_pos_emb.transpose(1, 2))  # (BS, hs, dLen1)
        d_pos_conved2 = self.d_conv2(d_pos_conved1)  # (BS, hs, 1)
        d_neg_conved1 = self.d_conv1(d_neg_emb.transpose(1, 2))  # (BS, hs, dLen1)
        d_neg_conved2 = self.d_conv2(d_neg_conved1)  # (BS, hs, 1)
         # interactions matrices
        if self.sim_type == "Dot" or self.sim_type == "Cos":
            interact_pos_mat1 = self.sim(q_emb, d_pos_emb)  # (BS, qlen, dlen)
            interact_pos_mat2 = self.sim(q_conved1.transpose(1, 2),
                                         d_pos_conved1.transpose(1, 2))  # (BS, qLen1, dLen1)
            interact_neg_mat1 = self.sim(q_emb, d_neg_emb)  # (BS, qlen, dlen)
            interact_neg_mat2 = self.sim(q_conved1.transpose(1, 2),
                                         d_neg_conved1.transpose(1, 2))  # (BS, qLen1, dLen1)
       # ..... some ops on the interaction matrices
       #. to get scores S_pos, S_neg
       return S_pos, S_neg

Actually I do need these masks, because I randomly initialized the PAD embedding vector and since the PAD embedding vector will also be learnable, I must mask it to 0 vector and then do the COS interaction, otherwise there will be non 0 interaction signal cos(PAD, PAD) which will cause a lot of noises in my model.

nyfbb · March 26, 2018, 2:49pm

Dear Jpeg729,
I enclosed some screenshots from terminal:
with mask d_pos_emb before first update
d_pos_emb Variable containing:
( 0 ,.,.) =
-0.1051 0.1341 0.1384 … -0.4195 0.0894 0.1757
0.2043 0.6410 0.2800 … 0.3296 0.9028 0.0257
-0.3194 0.1362 -0.0486 … -0.0811 0.3453 0.1107
… ⋱ …
-0.0000 -0.0000 -0.0000 … 0.0000 0.0000 -0.0000
-0.0000 -0.0000 -0.0000 … 0.0000 0.0000 -0.0000
-0.0000 -0.0000 -0.0000 … 0.0000 0.0000 -0.0000
with mask d_pos_emb after first update
18%20AM

wihout mask d_pos_emb before first update
d_pos_emb Variable containing:
( 0 ,.,.) =
-1.2080e-01 5.8023e-01 -3.1423e-02 … 4.0058e-01 2.9359e-01 1.4976e-01
-1.2080e-01 5.8023e-01 -3.1423e-02 … 4.0058e-01 2.9359e-01 1.4976e-01
-1.2080e-01 5.8023e-01 -3.1423e-02 … 4.0058e-01 2.9359e-01 1.4976e-01
… ⋱ …
-3.0333e-01 -2.1788e-01 -7.2018e-01 … 1.8724e-01 5.1078e-01 -5.1949e-01
-3.0333e-01 -2.1788e-01 -7.2018e-01 … 1.8724e-01 5.1078e-01 -5.1949e-01
-3.0333e-01 -2.1788e-01 -7.2018e-01 … 1.8724e-01 5.1078e-01 -5.1949e-01
without mask d_pos_emb after first update
( 0 ,.,.) =
-2.3951e-01 -3.3803e-01 -2.6433e-01 … 3.2791e-01 3.6259e-02 2.3877e-01
-2.0393e-01 1.3825e-01 5.7552e-02 … 2.7366e-01 -5.1705e-01 3.0394e-01
-5.5689e-01 1.9376e-01 -3.0927e-01 … 3.2666e-01 -1.9124e-01 -2.3107e-01
… ⋱ …
-3.0234e-01 -2.1688e-01 -7.1918e-01 … 1.8624e-01 5.1178e-01 -5.2049e-01
-3.0234e-01 -2.1688e-01 -7.1918e-01 … 1.8624e-01 5.1178e-01 -5.2049e-01
-3.0234e-01 -2.1688e-01 -7.1918e-01 … 1.8624e-01 5.1178e-01 -5.2049e-01

jpeg729 · March 26, 2018, 8:45pm

It looks like you calculate the cosine similarity after applying the mask, could that be the source of the NaNs?

nyfbb · March 26, 2018, 9:05pm

Dear Jpeg729,
Yeah, My goal is to use the masked embeddings sequence to calculate the cos similarity grid.
So that the in the interaction grid S(x_i, y_j) will have 0s in (PAD, PAD) positions.
Maybe I should apply a 2D mask on the interaction grid S? by multiplying for example a 2D mask:
1,1,1,0,0
1,1,1,0,0
0,0,0,0,0
But it’s more complicated and always have the problem of S * S_mask problem with non differentiable mask…

nyfbb · March 26, 2018, 10:35pm

The problem is resolved, I replaced my custom Cos sim function by pytorch F.normalize() and then simply dot the 2 normalized x and y. There might be some stability issue of my function, even if I did pay attention to add 1e-12 to the denominators…

Abhishek_Kumar2 · December 5, 2020, 11:03am

This is a function

func = torch.sqrt((dudx - dudy)**2 + (dudy -dudx)**2)
where dudx, dudy is derivative of U wrt input x and y
Sine the square root of zero is infinite, so I want to impute some values of func where it is less than 0 by 0.1, and applied following function for it:
func[(func<=0.0).detach()] = 0.1
This is func value i got

> tensor([1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01,
>         1.2064e-05, 4.1480e-06, 5.6752e-06, 1.0000e-01, 1.0000e-01, 9.2189e-06,
>         3.1551e-06, 4.3206e-06, 1.0000e-01, 1.0000e-01, 7.0325e-06, 2.4059e-06,
>         3.3041e-06, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01,
>         1.0000e-01], grad_fn=<IndexPutBackward>)

Now if i try to take derivative of func wrt inputs I get nan in place of imputed values as:

compute_grad(func, inputs)[:,0].reshape(nx,ny)

tensor([[        nan,         nan,         nan,         nan,         nan],
        [        nan, -1.7759e-05, -6.2834e-06, -8.6144e-06,         nan],
        [        nan, -1.3896e-05, -4.7949e-06, -6.5147e-06,         nan],
        [        nan, -1.0686e-05, -3.6399e-06, -4.9337e-06,         nan],
        [        nan,         nan,         nan,         nan,         nan]],
       grad_fn=<ViewBackward>)

Can someone help me to overcome this . I want the derivative of func without getting nan. Any help is appreciated. @smth, @ptrblck could you please have a look

omerlux · April 1, 2021, 7:22am

I’m having the same problem when I’m trying to implement variational dropout, when using the same mask over and over again.
Tried mask.detach() and also Variable(mask, requires_grad=False). Even tried to clone the mask every forward pass - Still getting nans after few iterations…
Check my post for more information… implementing-variational-dropout-cause-nan-values.
Thanks