CUDA error when using register_buffer instead of nn.Parameter

I’m trying to implement a Linear layer with an extra constant weight weight_fb to compute grad_input in backward pass.

So I implemented a MyLinearFunc Function similar to the one here. In forward function of MyLinear Module, return

MyLinearFunc.apply(input, self.weight, self.weight_fb, self.bias)

and the return of backward function in MyLinearFunc is

grad_input, grad_weight, None, grad_bias

When I initiate weight_fb as

self.weight_fb = nn.Parameter(torch.Tensor(out_features, in_features), requires_grad=False)

everything works fine.

But as weight_fb is not a learnable parameter, I thought it would be better to use

self.weight_fb =torch.Tensor(out_features, in_features)
self.register_buffer('feedback_weight', self.weight_fb)

But this gives CUDA error: an illegal memory access was encountered when using GPU.

Just wondering if nn.Parameter and Tensor with register_buffer() give the same training results, and if I did something wrong when using register_buffer().

Any help would be appreciated!

1 Like

Could you post a minimal code snippet to reproduce this issue?
Also, are you using a non-default device (i.e. not cuda:0)? If so, this error might be unrelated to your usage of parameters and buffers, but might be a bug in the extensions, which we are currently debugging.

I am using a default device. Here is a code snippet:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Function
from torch import Tensor
import math

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class MyLinearFunc(Function):

    def forward(ctx, input, weight, weight_fb, bias=None):
        ctx.save_for_backward(input, weight, weight_fb, bias)
        output =
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output

    def backward(ctx, grad_output):
        input, weight, weight_fb, bias = ctx.saved_tensors
        grad_input = grad_weight = grad_weight_fb = grad_bias = None

        if ctx.needs_input_grad[0]:
            grad_input = # feedback weight
        if ctx.needs_input_grad[1]:
            grad_weight = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[3]:
            grad_bias = grad_output.sum(0)

        return grad_input, grad_weight, grad_weight_fb, grad_bias

class MyLinear(nn.Module):
    __constants__ = ['in_features', 'out_features']
    in_features: int
    out_features: int
    weight: Tensor

    def __init__(self, in_features: int, out_features: int, bias: bool = True) -> None:
        super(MyLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = nn.Parameter(torch.Tensor(out_features, in_features))
        #self.weight_fb = nn.Parameter(torch.Tensor(out_features, in_features), requires_grad=False)
        self.weight_fb =torch.Tensor(out_features, in_features) # feedback weight
        self.register_buffer('feedback_weight', self.weight_fb)
        if bias:
            self.bias = nn.Parameter(torch.Tensor(out_features))
            self.register_parameter('bias', None)

    def reset_parameters(self) -> None:
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.weight_fb, a=math.sqrt(5)) # feedback weight
        if self.bias is not None:
            fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            nn.init.uniform_(self.bias, -bound, bound)

    def forward(self, input: Tensor) -> Tensor:
        return MyLinearFunc.apply(input, self.weight, self.weight_fb, self.bias)

    def extra_repr(self) -> str:
        return 'in_features={}, out_features={}, bias={}'.format(
            self.in_features, self.out_features, self.bias is not None
N, D_in, H, D_out = 64, 1000, 100, 10

model = nn.Sequential(
    MyLinear(D_in, H),
    MyLinear(H, D_out)

loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 0.1
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(3):
    x = torch.randn(N, D_in).to(device=device)
    y = torch.randn(N, D_out).to(device=device)
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)
    # Compute and print loss.
    loss = loss_fn(y_pred, y)


Thanks a lot!

Thanks for the code.
weight_fb needs to be on the device, as it’s not an nn.Parameter:

self.weight_fb =torch.Tensor(out_features, in_features).to(device)

However, the error message shouldn’t be an illegal memory access, so thanks for reporting it.

Thanks a lot for your reply! That solves the problem.

I thought register_buffer makes weight_fb carried to the device when the network is carried to the device. So register_buffer only makes weight_fb appeared in state_dict and can be saved? Does that mean I don’t need to register it if I don’t care about its value?

The feedback_weight buffer will be registered and pushed to the device.
However, you are not using it (via self.feedback_weight), but instead self.weight_fb, which is not the buffer.

I see. So instead of

self.weight_fb =torch.Tensor(out_features, in_features).to(device)
self.register_buffer('feedback_weight', self.weight_fb)

where self.feedback_weight is not actually used,

I can directly do

self.register_buffer('weight_fb', torch.Tensor(out_features, in_features))

Does that sound correct?

Yes, the second approach looks good. In the forward, you could still use self.weight_fb using this approach.

1 Like