Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding

for example.

import time
import torch
import torch.nn as nn

@torch.jit.interface
class ModuleInterface(torch.nn.Module):
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        pass
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features=200, out_features=1, bias=True)
        )
        self.net1 = nn.Parameter(torch.Tensor(1, 200))

    def forward(self, feat):
        feat.requires_grad_()
        m = torch.rand(1, 200).to("cuda")
        b = feat * m
        return b

class SANNet(nn.Module):
    def __init__(self):
        super(SANNet, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features=200, out_features=1, bias=True)
        )
        self.net1 = nn.Parameter(torch.Tensor(1, 200))
        self.net2 = nn.ModuleList()
        self.net2.append(Net())
    def forward(self, feat):
        feat.requires_grad_()
        m = torch.rand(1, 200).to("cuda")
        net:ModuleInterface = self.net[0]
        c = net(m)
        b = feat * c
        return b
torch.set_default_dtype(torch.float64)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = SANNet()
model = model.to(device)
model.train()
opt = torch.optim.Adam(model.parameters(),lr=0.001)
f = nn.MSELoss()
aa = torch.rand(1,200)
data = torch.utils.data.DataLoader(aa, batch_size=1, shuffle=True, num_workers=1)
for i,x in enumerate(data) :
    x = torch.rand(1, 200).to(device)
    y = model(x)
    loss = f(x, y)/10.0
    loss.backward()
    opt.step()
    opt.zero_grad()

when I set torch.set_default_dtype(torch.float32),it can work.
but I set torch.set_default_dtype(torch.float64),it will report an error

Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding

I can identify the location of the problem as

net:ModuleInterface = self.net[0]
c = net(m)
1 Like

Thanks for the report, this seems like a bug. I’ve simplified your repro and filed an issue here Foreach optimizers don't work with torch.set_default_dtype(torch.float64) · Issue #111671 · pytorch/pytorch · GitHub.

Is there any particular reason you need to use torch.float64 for your case?

1 Like

Hello fiend:
I’m having the same problem, do you have idea what is the problem in the bug?
RuntimeError: Tensors of the same index must be on the same device and the same dtype except step tensors that can be CPU and float32 notwithstanding

Thanks!

Hi,
Looks like someone has signed up to work on it, so there should be a fix in a nightly sometime soon, but some workarounds today are:

  • use foreach=False with optimizer (you’d be missing out on performance here)
  • avoid torch.set_default_dtype(torch.float64)

Thank you very much for your attention to this issue

One possible solution is not to use torch.set_default_dtype(torch.float64),But we can explicitly specify the type in the network

model.double
m = torch.rand(1,200).to(device).to(torch.float64)

It gets verbose, but at least you can train using double

Is this fixed yet in the new build? I am still getting an error.

A more verbose solution would be to manually display specify all types and devices

Any news on this issue? I’m still getting this error in 2.1.1.

Same issue here, but I do have torch.set_default_dtype(torch.float32) and still getting the same exact issue.