Layernorm backward

Why does PyTorch uses three different kernels for backward (four when elementwise affine is True) for LayerNorm backward. NVIDIA Apex seems to use only a single kernel or two when elementwise affine is True. Are there some edge cases Apex does not deal with and PyTorch does ?.

Also how is the scale and bias here (pytorch/layer_norm_kernel.cu at master · pytorch/pytorch · GitHub) different from gamma and beta. Isn’t gamma and beta the scale and bias parameters for layernorm?.

While it could be that PyTorch’s kernels are less optimized, I do think that apex uses three kernels if you want all gradients for input and parameters (cuComputePartGradGammaBeta, cuComputeGradGammaBeta, cuComputeGradInput) so the difference is not quite as stark.

That said, if you have benchmarks that show a PyTorch being slower than Apex, I’m quite certain people will look into it more (PyToch did adopt the LayerNorm forward at some point).

Best regards

Thomas

I did a high level mapping. The ComputePartGradGammaBeta and ComputeGradGammaBeta are together equivalent to the PyTorch GammaBetaBackwardKernel basically calculating the gradients for gamma and beta if those params are set to trainable. Despite PyTorch doing it through one kernel, the perf of this gradient computation is actually much worse than Apex since the GammaBetaBackward kernel does not effectively utilize the GPU. Basically low occupancy and memory throughput utilization. I can post some benchmark results but we have seen it in many cases. Other than that what Apex does through cuComputeGradInput maps to a set of three kernels in PyTorch(ComputeInternalGradients,ComputeGradientFusedParamsCUDAKernel and LayerNormBackwardKernel). Its not clear why this part is not fused though the GPU utilization is fairly good for these kernels.