Can `autocast` handle networks with layers having different dtypes?

ronan.fruit · October 24, 2024, 2:29pm

Hi,

torch version = 2.5.0

I am wondering whether torch.autocast can handle neural networks with layers having different dtypes.

The following code suggests it cannot:

import torch
net = torch.nn.Sequential(
    torch.nn.Linear(2, 10, dtype=torch.float16),
    torch.nn.ReLU(),
    torch.nn.Linear(10, 10, dtype=torch.float32),
    torch.nn.ReLU(),
)
with torch.autocast("cuda"):
    net(torch.as_tensor([[1., 2.]], dtype=torch.float16))

=> it raises RuntimeError: mat1 and mat2 must have the same dtype, but got Half and Float

But at some point recently I had started to believe autocast was able to handle such cases.
I loaded a pretrained model on HuggingFace and changed just the dtype of one layer. Without autocast, the same RuntimeError was raised, but with autocast the error disappeared.
Here is a Minimum Reproducible Example:

from transformers import pipeline
pipe = pipeline(
    "text-classification", 
    model="Qwen/Qwen2.5-0.5B",
    torch_dtype=torch.float16,
    device_map="cuda"
)
pipe_model_named_parameters = {k: v for k, v in pipe.model.named_parameters()}
for p in pipe_model_named_parameters:
    if "score" in p:  # convert last trainable layer to float32 for stability during training
        pipe_model_named_parameters[p].data = pipe_model_named_parameters[p].data.to(dtype=torch.float32)
with torch.autocast("cuda"):  # context manager needed in case not all layers have the same dtype
    print(pipe(["a", "b", "z"]))

Any insight on what autocast allows and does not allow ?

soulitzer · December 14, 2024, 2:34am

Seems to work if you use cuda tensors with torch.autocast(“cuda”)

import torch
net = torch.nn.Sequential(
    torch.nn.Linear(2, 10, dtype=torch.float16),
    torch.nn.ReLU(),
    torch.nn.Linear(10, 10, dtype=torch.float32),
    torch.nn.ReLU(),
).to("cuda")
with torch.autocast("cuda"):
    net(torch.tensor([[1., 2.]], dtype=torch.float16, device="cuda"))

Aknw_Fen · December 14, 2024, 2:41pm

I’m unsure that casting all layers manually autocast will then have any effect:

Autocast will respect types assigned manually:

Ops called with an explicit dtype=... argument are not eligible, and will produce output that respects the dtype argument.

soulitzer · December 15, 2024, 4:50pm

The dtype argument here for nn.Linear should just be used to initialize the dtype of the weight
torch/nn/modules/linear.py

ronan.fruit · January 10, 2025, 2:26pm

Thanks for your reply @soulitzer
Indeed it seems the problem was caused by net not being on cuda device while cuda is specified in torch.autocast.
Both options below work fine:

adding .to("cuda") to net and specifying device="cuda" as @soulitzer pointed out
replacing torch.autocast("cuda") by torch.autocast("cpu")

In case 1, the returned tensor has type torch.float16 while in case 2 the returned tensor has type torch.bfloat16. Not sure why there is such a difference.