Explicit casting inside autocast context

Can we use explicit casts inside an autocast enabled context? i.e. input=input.to(torch.float32), input=input.to(torch.bfloat16). Based on the documentation here: Automatic Mixed Precision package - torch.amp — PyTorch 2.6 documentation. , it seems like we cant use .half() and .bfloat16() on the inputs/models inside the enabled context. No reasoning for that is provided, I think the reasoning is that .half() and .bfloat16() creates a new tensor which is not tracked by autograd? and thats why it might be a problem during the backward as the wrong dtype might be used.

Based on my understanding, autocast works on a per op basis, so if I want a specific op which is a part of the autocast policy to execute in a different dtype, then it must be done in an autocast disabled context. However, if an op is not a part of the policy, then its fine to perform explicit casting using .to() . Could anyone confirm if this understanding is correct? There is very little documentation related to the effects of explicit casting in autocast context.

Disable autocast via a nested context as described in the docs, if you want to explicitly change the dtype:

autocast(enabled=False) subregions can be nested in autocast-enabled regions. Locally disabling autocast can be useful, for example, if you want to force a subregion to run in a particular dtype. Disabling autocast gives you explicit control over the execution type.

So we should disable the autocast context even if the op is not a part of the autocast policy? I understand the context needs to be disabled in the scenario where an op is part of the policy and needs a different dtype.

I was looking at some base where autocast is used, i do not see this happening i.e. looking at Nvidia Nemo: NeMo/nemo/collections/nlp/modules/common/megatron/transformer.py at main · NVIDIA/NeMo · GitHub, ParallelMLP uses RPL and CPL layers. these layers internally cast all of the inputs to BF16 if autocast is enabled, however, this casting is done inside the enabled context as shown here:
apex/apex/transformer/tensor_parallel/layers.py at master · NVIDIA/apex · GitHub

So, is this a bug in Nemo? Or we can explicitly cast inside the enabled autocast context, but we need to be careful about it.