Could you try to manually write the normalization and allow nvFuser to code-gen the code for you similar to this use case?