Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding

gengxingze · October 20, 2023, 7:01am

for example.

import time
import torch
import torch.nn as nn

@torch.jit.interface
class ModuleInterface(torch.nn.Module):
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        pass
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features=200, out_features=1, bias=True)
        )
        self.net1 = nn.Parameter(torch.Tensor(1, 200))

    def forward(self, feat):
        feat.requires_grad_()
        m = torch.rand(1, 200).to("cuda")
        b = feat * m
        return b

class SANNet(nn.Module):
    def __init__(self):
        super(SANNet, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features=200, out_features=1, bias=True)
        )
        self.net1 = nn.Parameter(torch.Tensor(1, 200))
        self.net2 = nn.ModuleList()
        self.net2.append(Net())
    def forward(self, feat):
        feat.requires_grad_()
        m = torch.rand(1, 200).to("cuda")
        net:ModuleInterface = self.net[0]
        c = net(m)
        b = feat * c
        return b
torch.set_default_dtype(torch.float64)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = SANNet()
model = model.to(device)
model.train()
opt = torch.optim.Adam(model.parameters(),lr=0.001)
f = nn.MSELoss()
aa = torch.rand(1,200)
data = torch.utils.data.DataLoader(aa, batch_size=1, shuffle=True, num_workers=1)
for i,x in enumerate(data) :
    x = torch.rand(1, 200).to(device)
    y = model(x)
    loss = f(x, y)/10.0
    loss.backward()
    opt.step()
    opt.zero_grad()

when I set torch.set_default_dtype(torch.float32),it can work.
but I set torch.set_default_dtype(torch.float64),it will report an error

Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding

I can identify the location of the problem as

net:ModuleInterface = self.net[0]
c = net(m)

soulitzer · October 20, 2023, 5:43pm

Thanks for the report, this seems like a bug. I’ve simplified your repro and filed an issue here Foreach optimizers don't work with torch.set_default_dtype(torch.float64) · Issue #111671 · pytorch/pytorch · GitHub.

Is there any particular reason you need to use torch.float64 for your case?

DavidVelazco26 · October 21, 2023, 2:54am

Hello fiend:
I’m having the same problem, do you have idea what is the problem in the bug?
RuntimeError: Tensors of the same index must be on the same device and the same dtype except step tensors that can be CPU and float32 notwithstanding

Thanks!

soulitzer · October 22, 2023, 2:10pm

Hi,
Looks like someone has signed up to work on it, so there should be a fix in a nightly sometime soon, but some workarounds today are:

use foreach=False with optimizer (you’d be missing out on performance here)
avoid torch.set_default_dtype(torch.float64)

gengxingze · October 23, 2023, 7:00am

Thank you very much for your attention to this issue

gengxingze · October 23, 2023, 7:05am

One possible solution is not to use torch.set_default_dtype(torch.float64),But we can explicitly specify the type in the network

model.double
m = torch.rand(1,200).to(device).to(torch.float64)

It gets verbose, but at least you can train using double

Judah_Goldfeder · November 16, 2023, 1:00am

Is this fixed yet in the new build? I am still getting an error.

gengxingze · November 16, 2023, 1:40am

A more verbose solution would be to manually display specify all types and devices

edwardmilsom · November 28, 2023, 12:04pm

Any news on this issue? I’m still getting this error in 2.1.1.

mmitjans · December 19, 2023, 2:54pm

Same issue here, but I do have torch.set_default_dtype(torch.float32) and still getting the same exact issue.