I was trying to do text matching task which needs to construct an interaction grid S , each element S_ij is a cossim(x_i, y_j) that is S_{ij} = cossim(x_i, y_j).
The x, y are extracted embeddings with:
x.size() = (BatchSize, xLen, emb_dim),
y.size() = (BatchSize, yLen, emb_dim).
To do Batch cossim to obtain a S whose size is (BS, xLen, yLen) (S_{ij}= cossim(x_i, y_j)), I wrote

``````def cossim(X, Y):
""" calculate the cos similarity between X and Y: cos(X, Y)
X: tensor (BS, x_len, hidden_size)
Y: tensor (BS, y_len, hidden_size)
returns: (BS, x_len, y_len)
"""
X_norm = torch.sqrt(torch.sum(X ** 2, dim=2)).unsqueeze(2)  # (BS, x_len, 1)
Y_norm = torch.sqrt(torch.sum(Y ** 2, dim=2)).unsqueeze(1)  # (BS, 1, y_len)
S = torch.bmm(X, Y.transpose(1,2)) / (X_norm * Y_norm + 1e-5)
return S
``````

I know that cossim at (0, 0) is not differentiable, so I initialized the padding embedding to be a random vector (non zero)
In the input x, and y (x, y are packed into Variable) ,the last several words are paddings so I must apply mask to filter the padding vector out by:

``````x_mask = torch.ne(x, 0).unsqueeze(2).float() # since x is packed as Variable, x_mask is also a Variable
``````

After one back prop the x, y will have some part as NaN
If I don’t apply mask, there will be no NaN problem:

``````x = ...
y = ...
``````

I detached the x_mask and y_mask by 2 methods to make it as non-learnable parameter so that they won’t mess up the gradients by:
(1)x, y are variables, extract x.data, y.data then generate mask, then pack mask into variable and apply mask

``````x_mask = torch.ne(x.data, 0).unsqueeze(2).float() # since x is packed as Variable, x_mask is also a Variable
``````

``````x_mask = torch.ne(x, 0).unsqueeze(2).float() # since x is packed as Variable, x_mask is also a Variable
``````

But no use, they both returned NaN in x, and y after one backprop
When I removed the multiplication of the mask, everything is ok.
Is there a problem of Cossim or the mask? How can I achieve this?
Thanks

1 Like

`torch.ne` has gradient zero almost everywhere and gradient undefined when `x == 0`. In practice, if `x == 0` pytorch returns 0 as gradient of `torch.ne`. Therefore detaching `x_mask` is not useful.

`x * x_mask` is basically an identity mapping for some elements of `x` in which case the gradients flow through unmodified, or a zero mapping in which case the gradients are blocked.

I can’t see why you might get NaN with the mask and not without. Could you post some more code?

Dear jpeg729,
Thanks for help:

``````def forward(self, q, d_pos, d_neg):
""" apply rel
q: LongTensor (BS, qlen) Variable input
d: LongTensor (BS, dlen) Variable input
returns R1, R2, R3: relevance of 3 level (BS,)
"""
q_mask = torch.ne(q, 0).unsqueeze(2).float()  # (BS, qlen, 1)
d_pos_mask = torch.ne(d_pos, 0).unsqueeze(2).float()  # (BS, dlen, 1)
d_neg_mask = torch.ne(d_neg, 0).unsqueeze(2).float()  # (BSm dlen, 1)
q_emb = self.emb_mod(q) * q_mask  # (BS, qlen, emb_size)
d_pos_emb = self.emb_mod(d_pos) * d_pos_mask  # (BS, dlen, emb_size)
d_neg_emb = self.emb_mod(d_neg) * d_neg_mask  # (BS, dlen, emb_size)
# do convs
q_conved1 = self.q_conv1(q_emb.transpose(1, 2))  # (BS, hs, qLen1)
q_conved2 = self.q_conv2(q_conved1)  # (BS, hs, 1)
d_pos_conved1 = self.d_conv1(d_pos_emb.transpose(1, 2))  # (BS, hs, dLen1)
d_pos_conved2 = self.d_conv2(d_pos_conved1)  # (BS, hs, 1)
d_neg_conved1 = self.d_conv1(d_neg_emb.transpose(1, 2))  # (BS, hs, dLen1)
d_neg_conved2 = self.d_conv2(d_neg_conved1)  # (BS, hs, 1)
# interactions matrices
if self.sim_type == "Dot" or self.sim_type == "Cos":
interact_pos_mat1 = self.sim(q_emb, d_pos_emb)  # (BS, qlen, dlen)
interact_pos_mat2 = self.sim(q_conved1.transpose(1, 2),
d_pos_conved1.transpose(1, 2))  # (BS, qLen1, dLen1)
interact_neg_mat1 = self.sim(q_emb, d_neg_emb)  # (BS, qlen, dlen)
interact_neg_mat2 = self.sim(q_conved1.transpose(1, 2),
d_neg_conved1.transpose(1, 2))  # (BS, qLen1, dLen1)
# ..... some ops on the interaction matrices
#. to get scores S_pos, S_neg
return S_pos, S_neg
``````

Actually I do need these masks, because I randomly initialized the PAD embedding vector and since the PAD embedding vector will also be learnable, I must mask it to 0 vector and then do the COS interaction, otherwise there will be non 0 interaction signal cos(PAD, PAD) which will cause a lot of noises in my model.

Dear Jpeg729,
I enclosed some screenshots from terminal:
with mask d_pos_emb before first update
d_pos_emb Variable containing:
( 0 ,.,.) =
-0.1051 0.1341 0.1384 … -0.4195 0.0894 0.1757
0.2043 0.6410 0.2800 … 0.3296 0.9028 0.0257
-0.3194 0.1362 -0.0486 … -0.0811 0.3453 0.1107
… ⋱ …
-0.0000 -0.0000 -0.0000 … 0.0000 0.0000 -0.0000
-0.0000 -0.0000 -0.0000 … 0.0000 0.0000 -0.0000
-0.0000 -0.0000 -0.0000 … 0.0000 0.0000 -0.0000
with mask d_pos_emb after first update wihout mask d_pos_emb before first update
d_pos_emb Variable containing:
( 0 ,.,.) =
-1.2080e-01 5.8023e-01 -3.1423e-02 … 4.0058e-01 2.9359e-01 1.4976e-01
-1.2080e-01 5.8023e-01 -3.1423e-02 … 4.0058e-01 2.9359e-01 1.4976e-01
-1.2080e-01 5.8023e-01 -3.1423e-02 … 4.0058e-01 2.9359e-01 1.4976e-01
… ⋱ …
-3.0333e-01 -2.1788e-01 -7.2018e-01 … 1.8724e-01 5.1078e-01 -5.1949e-01
-3.0333e-01 -2.1788e-01 -7.2018e-01 … 1.8724e-01 5.1078e-01 -5.1949e-01
-3.0333e-01 -2.1788e-01 -7.2018e-01 … 1.8724e-01 5.1078e-01 -5.1949e-01
without mask d_pos_emb after first update
( 0 ,.,.) =
-2.3951e-01 -3.3803e-01 -2.6433e-01 … 3.2791e-01 3.6259e-02 2.3877e-01
-2.0393e-01 1.3825e-01 5.7552e-02 … 2.7366e-01 -5.1705e-01 3.0394e-01
-5.5689e-01 1.9376e-01 -3.0927e-01 … 3.2666e-01 -1.9124e-01 -2.3107e-01
… ⋱ …
-3.0234e-01 -2.1688e-01 -7.1918e-01 … 1.8624e-01 5.1178e-01 -5.2049e-01
-3.0234e-01 -2.1688e-01 -7.1918e-01 … 1.8624e-01 5.1178e-01 -5.2049e-01
-3.0234e-01 -2.1688e-01 -7.1918e-01 … 1.8624e-01 5.1178e-01 -5.2049e-01

It looks like you calculate the cosine similarity after applying the mask, could that be the source of the NaNs?

Dear Jpeg729,
Yeah, My goal is to use the masked embeddings sequence to calculate the cos similarity grid.
So that the in the interaction grid S(x_i, y_j) will have 0s in (PAD, PAD) positions.
Maybe I should apply a 2D mask on the interaction grid S? by multiplying for example a 2D mask:
1,1,1,0,0
1,1,1,0,0
0,0,0,0,0
But it’s more complicated and always have the problem of S * S_mask problem with non differentiable mask…

The problem is resolved, I replaced my custom Cos sim function by pytorch F.normalize() and then simply dot the 2 normalized x and y. There might be some stability issue of my function, even if I did pay attention to add 1e-12 to the denominators…

1 Like

This is a function

func = torch.sqrt((dudx - dudy)**2 + (dudy -dudx)**2)
`where dudx, dudy is derivative of U wrt input x and y`
Sine the square root of zero is infinite, so I want to impute some values of func where it is less than 0 by 0.1, and applied following function for it:
func[(func<=0.0).detach()] = 0.1
This is func value i got

``````> tensor([1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01,
>         1.2064e-05, 4.1480e-06, 5.6752e-06, 1.0000e-01, 1.0000e-01, 9.2189e-06,
>         3.1551e-06, 4.3206e-06, 1.0000e-01, 1.0000e-01, 7.0325e-06, 2.4059e-06,
>         3.3041e-06, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01, 1.0000e-01,
``````

Now if i try to take derivative of func wrt inputs I get nan in place of imputed values as:

``````compute_grad(func, inputs)[:,0].reshape(nx,ny)

tensor([[        nan,         nan,         nan,         nan,         nan],
[        nan, -1.7759e-05, -6.2834e-06, -8.6144e-06,         nan],
[        nan, -1.3896e-05, -4.7949e-06, -6.5147e-06,         nan],
[        nan, -1.0686e-05, -3.6399e-06, -4.9337e-06,         nan],
[        nan,         nan,         nan,         nan,         nan]],