Some tensors getting left on CPU despite calling model.to("cuda")

Hello,

Would like to politely request another set of eyes on some problems we’re running into trying to train our model.

Background: We’re using complex64 dtype tensors generated from torchaudio stft(). We are using a git repo, complexPyTorch, which provides complex versions of layers etc. We are hoping to train the model on GPU, and we have successfully trained other models on this same gpu/cuda/pytorch setup.

Here’s some relevant code from our training file:

device = torch.device("cuda")
net = model(params)
net = net.to(device)

criterion = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=learning_rate)

# Training

start_time = round(time())

for epoch in range(10):  # loop over the dataset multiple times

    running_loss = 0.0

    for i, data in enumerate(dataLoader, 0):

        # get the inputs; data is a list of [inputs, labels]

        inputs = data['song']

        labels = data['vocal']

        inputs = inputs.to(device)

        labels = labels.to(device)

        optimizer.zero_grad()

        outputs = net(inputs)

        loss = criterion(outputs, labels)

        loss.backward()

        optimizer.step()

        running_loss += loss.item()

print('Finished Training', str(round(time()) - start_time))

PATH = './PhATPUSSE.pth'

torch.save(net.state_dict(), PATH)

So, we attempt to move the model and all data to the GPU.

When we try to train we get this error:

RuntimeError: Tensor for argument #3 'mat2' is on CPU, but expected it to be on GPU (while checking arguments for addmm)

We followed the trace to figure out it is the weight tensors of any of our nn.Linear layers that do not make it onto the GPU. Many other tensors do successfully make it to the GPU, such as our data and the tensors associated with Conv2d layers. Any tips are appreciated, thanks so much! If you would like to see any of the architecture code etc. lmk

Could you post the model definition and in particular how the problematic nn.Linear layer is initialized?

Yes definitely. Gonna try and only copy in relevant stuff for brevity.

Definition of complex layers.

root/complex/complexLayers.py:

import torch
from torch.nn import Module, Parameter, init, Conv2d, Linear, BatchNorm1d, BatchNorm2d, LayerNorm, ConvTranspose2d

def apply_complex(fr, fi, input):
    return (fr(input.real)-fi(input.imag)).type(torch.complex64) \
            + 1j*(fr(input.imag)+fi(input.real)).type(torch.complex64)

class ComplexLinear(Module):

    def __init__(self, in_features, out_features, bias=True):
        super(ComplexLinear, self).__init__()
        self.fc_r = Linear(in_features, out_features, bias=bias)
        self.fc_i = Linear(in_features, out_features, bias=bias)

    def forward(self, input):
        return apply_complex(self.fc_r, self.fc_i, input)

Here’s where our model lives. The model holds the encoder which holds a transformer block which holds a Multiheaded Self Attention model which holds a single Attention Head. The complex linear layers are in the Attention Head and the complex layers are imported from another folder in the root directory.

root/models/TransUNet.py

import torch
import torch.nn as nn
from complex.complexLayers import ComplexConv2d, ComplexLinear, ComplexReLU, ComplexBatchNorm1d, ComplexDropout, NaiveComplexLayerNorm

class model(nn.Module):

    def __init__(self, params):
        super(model, self).__init__()

        self.encoder = Encoder(params)
        self.reshaper = Reshape()
        self.decoder = Decoder()

    def forward(self, x):

        # Converting x to complex
        x = torch.view_as_complex(x)
        x = x.permute(1, 0, 2, 3)

        x = self.encoder(x)
        x = self.reshaper(x)
        x = self.decoder(x)

        # Converting output to real
        x = torch.view_as_real(x)

        return x

      
class Encoder(nn.Module):

  def __init__(self, params):
      super(Encoder, self).__init__()

      self.embedding = Embedding(params=params)

      self.transformers = nn.Sequential(OrderedDict([("Block " + str(i), TransformerBlock(params)) for i in range(config.num_transformers)]))

  def forward(self, x):
      x = self.embedding(x)
      x = self.transformers(x)
      return x


class TransformerBlock(nn.Module):

    def __init__(self, params):
        super(TransformerBlock, self).__init__()

        self.attn_norm = NaiveComplexLayerNorm((params["num_patches"], config.encoding_size), eps=config.norm_eps)
        self.attn = MSA()

        self.ffn_norm = NaiveComplexLayerNorm((params["num_patches"], config.encoding_size), eps=config.norm_eps)
        self.ffn = MLP()


    def forward(self, x):
        h = x
        x = self.attn_norm(x)
        x = self.attn(x)
        x = x + h

        h = x
        x = self.ffn_norm(x)
        x = self.ffn(x)
        x = x + h

        return x

        
class MSA(nn.Module):
    def __init__(self):
        super(MSA, self).__init__()

        self.heads = [AttentionHead(2, 1) for _ in range(config.num_heads)]

        self.w = ComplexLinear(config.attention_size * config.num_heads, config.encoding_size, bias=config.attention_bias)

        self.dropout = ComplexDropout(config.dropout_rate)

    def forward(self, x):
        all_head_size = x.shape[-1] * config.num_heads
        multi_head_shape = list(x.shape)
        multi_head_shape[-1] = all_head_size

        multi_head = torch.zeros(multi_head_shape, dtype=torch.complex64)
        for i, head in enumerate(self.heads):
            multi_head[:, :, :, (i * config.attention_size):((i+1) * config.attention_size)] = head(x)

        x = self.w(multi_head)
        x = self.dropout(x)

        return x

class AttentionHead(nn.Module):

    def __init__(self, in_channels=2, out_channels=1):
        super(AttentionHead, self).__init__()

        #self.num_heads = config.num_heads
        
        self.keys = ComplexLinear(config.encoding_size, config.attention_size, bias=config.attention_bias)
        self.queries = ComplexLinear(config.encoding_size, config.attention_size, bias=config.attention_bias)
        self.values = ComplexLinear(config.encoding_size, config.attention_size, bias=config.attention_bias)

        self.complex_map = nn.Conv2d(in_channels, out_channels, 3, padding=1)

        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(self, x):

        keys = self.keys(x)
        queries = self.queries(x)
        values = self.values(x)

        scores = complex_matmul(queries, keys.transpose(-1, -2))
        scores /= config.attention_size ** 0.5

        scores = torch.view_as_real(scores)
        scores = scores[:, 0, :, :]
        scores = scores.permute(0, 3, 1, 2)
        scores = self.complex_map(scores)
        scores = nn.Softmax(dim=-1)(scores)

        scores = self.dropout(scores)

        scores = torch.complex(scores, torch.zeros_like(scores))

        return complex_matmul(scores, values)

Thanks so much for your time man it helps a ton!!

Thanks for the code!
It looks generally good, but this initialization should be problematic:

self.heads = [AttentionHead(2, 1) for _ in range(config.num_heads)]

Creating layers in a plain Python list will not properly register their parameters in the parent module and thus neither will model.parameters() return them (i.e. the optimizer might not update them) nor will model.to('cuda') work on them (and they will stay on the same device).
Use nn.ModuleList instead and rerun the code.
Let me know, if this works.

4 Likes

Yes, that was the problem! Really apprecaite the help. We also ended up moving to the nightly build for more optimizer complex support, but from the looks of it our training is working!

Gabe

Is it like any python data structure is saved on CPU per default? (Lists, dicts, etc…) , even if the elements of the data structure are tensors on the GPU?

Yes, data structures are stored on the host, but the error was raised by using a plain Python list containing trainable PyTorch parameters, which is not recognized and registered inside the nn.Module. The issue would also be visible if all PyTorch tensors are stored on the CPU and is unrelated to the device usage.
Internally the nn.Module will register trainable parameters and buffers in its internal ._parameters and ._buffers attributes as seen here:

bn = nn.BatchNorm2d(3)

# internal attributes
print(bn._parameters)
# OrderedDict([('weight', Parameter containing:
# tensor([1., 1., 1.], requires_grad=True)), ('bias', Parameter containing:
# tensor([0., 0., 0.], requires_grad=True))])

print(bn._buffers)
# OrderedDict([('running_mean', tensor([0., 0., 0.])), ('running_var', tensor([1., 1., 1.])), ('num_batches_tracked', tensor(0))])

these internal OrderedDicts are then used to return the parameters and buffers via:

print(list(bn.parameters()))
# [Parameter containing:
# tensor([1., 1., 1.], requires_grad=True), Parameter containing:
# tensor([0., 0., 0.], requires_grad=True)]

which is also often used to pass the parameters to the optimizer or to move data to the GPU.
Using plain Python lists skips this registering step and neither the .to() operation nor .parameters() or .buffers() would return any of these tensors.

1 Like