Suppose, h is the output embedding vector of a network. Now, out of the D dimensions of h, I multiply the last ‘dr’ dimensions with 0 before passing it to the loss function, while keeping the rest of the (D - dr) dimensions unchanged. That is, h = [h1, h2, h3, ... hk, 0, 0, 0, ..., 0, 0]
This is done for all the output embeddings in a batch.
Now, the question is, when the gradients are back-propagated, do the gradients along those ‘dr’ dimensions become zero too?