Optimiser expects CPU when adding custom layer, runs fine without it

Hello, I understand most of the errors of the form “expected device cpu but got cuda” or the vice versa arise from not properly pushing tensors/models to proper devices. But, I seem to have run into something different. I am trying to add a custom layer to a simple GCN net.

I have checked all tensors and model params are on GPU, however, whenever i add the layer to the network the optimizer breaks with the following error:

File "driver_crf_gcn.py", line 35, in train
  File "/home/cmb-05/qbio/raktimmi/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/home/cmb-05/qbio/raktimmi/anaconda3/lib/python3.7/site-packages/torch/optim/adam.py", line 96, in step
    grad = grad.add(p, alpha=group['weight_decay'])
RuntimeError: expected device cpu but got device cuda:0

I am quite helpless at this point as i am not super experienced, if anyone is able to let me know what’s happening it will be a great help.

Here’s what my custom layer looks like:

class CRFLayer(MessagePassing):
    def __init__(self):
        super(CRFLayer, self).__init__(aggr='add')          # "Add" aggregation.
        self.log_alpha = nn.Parameter(torch.rand(1),requires_grad = True)
        self.log_beta = nn.Parameter(torch.rand(1),requires_grad = True)
        self.logsigmasq = nn.Parameter(torch.rand(1),requires_grad = True)
        self.sigmasq = torch.exp(self.logsigmasq).to(device)           # Using same process for > 0  constraint as \alpha and \beta for consistency
        self.alpha = torch.exp(self.log_alpha).to(device)
        self.beta = torch.exp(self.log_beta).to(device)
        self.gij = None                                     # temporary memorisation of gij for each protein to avoid repeat computation
                                                            # will be of size (E x 1) when assigned in the first call to propagate 

    def forward(self, x, edge_index):                       # edge_index has shape [2, E]
        b = x.clone()
        x = self.propagate(edge_index, size=(x.size(0), x.size(0)), x=x, b=b)
        self.gij = None                                     # set gij None for the next train graph
        return x

    def message(self, x_j, b_i, b_j):
        ''' For each (i,j) \in E, compute gij'''
        if(self.gij == None):                               # size (E x 1)     ## Calculate g_ij only at the start 
            self.gij = torch.exp( torch.nn.functional.cosine_similarity(b_i, b_j, dim=-1)/(self.sigmasq)).to(device)
        ''' For each edge (i,j) \in E, copmute gij*xj'''
        gijxj = self.gij.view(-1,1)*x_j                     # size (E x 2)      ## compute g_ijxj for each edge (i,j)
        ret = torch.cat((self.gij.view(-1,1),gijxj),dim=1)  # size (E x 3) ## message (gij, gijxj) for each edge (i,j)
        return ret

    def update(self, aggr_out, b, x):                       # aggr_out has size (V x 3)
        ''' For each vertex i \in V, aggregated gij over  its neighbour j's, required in the denominator of the update equation'''
        gij_aggregated_over_j = aggr_out[:,0]               # size (V x 1) # sum_{j \in N(i)}g_ij  \forall{i}
        ''' For each vertex i \in V, aggregated gij*xj over  its neighbour j's, required in the neumerator of the update equation'''
        gijxj_aggregated_over_j = aggr_out[:,1:]            # size (V x 2) # sum_{j \in N(i)}g_ijxj \forall{i}

        x = (self.alpha*b + self.beta*gijxj_aggregated_over_j)
        x = x/((self.alpha + self.beta*gij_aggregated_over_j).view(-1,1) + EPS)
        return x

Here’s the network, it runs perfect if I remove the new layer, also, it runs perfect on cpu with or without the layer.

class GCNNet(nn.Module):
    def __init__(self, dataset):
        super(GCNNet, self).__init__()
        self.conv1 = GCNConv(dataset.num_features, 2)
        self.crf = CRFLayer()
        self.conv2 = GCNConv(2, dataset.num_classes)
        self.crf_loss = CGNF_Loss()

    def forward(self, data, y, A, E):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(x, training=self.training)
        x = self.crf(x, edge_index)
        x = self.conv2(x, edge_index)
        x = F.log_softmax(x, dim=1)
        loss = self.crf_loss(x,y,edge_index, E)
        return x, loss

Ok. So, I found out the problem. It is the way I have been transforming the parameters at the init step of the custom layer. using the registered log_alpha, log_beta, logsigmasq params directly instead of exponentiating them in the init step solved the issue.