Grad_input_mask layer norm

Trinayan_Baruah · August 3, 2021, 8:22pm

Hi,

What is the grad input mask that is passed in the layer norm kernel call(pytorch/layer_norm_kernel.cu at master · pytorch/pytorch · GitHub). I notice it invokes a bunch of kernels like LayerNormBackward whenever it is set. Can someone tell me what it actually means and when it is used/not used?.

Thanks.

ptrblck · August 9, 2021, 5:32am

Based on the linked code the mask defines, if specific gradients (dgamma and dbeta) should be initialized and passed to the gradient calculation. It should be used to skip the initialization of these tensors if the gradients are not needed.