Grad_input_mask layer norm


What is the grad input mask that is passed in the layer norm kernel call(pytorch/ at master · pytorch/pytorch · GitHub). I notice it invokes a bunch of kernels like LayerNormBackward whenever it is set. Can someone tell me what it actually means and when it is used/not used?.


Based on the linked code the mask defines, if specific gradients (dgamma and dbeta) should be initialized and passed to the gradient calculation. It should be used to skip the initialization of these tensors if the gradients are not needed.