Defining custom activation function

I created a custom activation function MyReLU
howevre when I use it in the two layer models I get the error

MyReLU.apply is not a Module subclass

MyReLU is a subclass of torch.autograd.Function

You don’t use Function in places where Module is used, i.e. in __init__ of main module. You just invoke MyReLU.apply in forward(). If you want to use Function is containers like nn.Sequential, you must wrap it in a Module.

1 Like

When I use MyReLU.apply in forward of the method in a module, it does not work either:

class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        grad_input = grad_output.clone()
        return grad_input


class weightedReLU(nn.Module):
    def __init__(self, weights = 1):
        super().__init__()
        self.weights = weights* nn.Parameter(torch.ones(1))
        self.weights = Variable(self.weights.data, requires_grad=True)
        
    def forward(self, input):
        ex = self.weights.cuda()*MyReLU.apply(input)
        return ex

Maybe this can help!

As Alex mentioned they are not working in containers like nn.sequential.

IDK, your module works for me. If you use JIT, use additional wrapper:

@jit.ignore
def MyReLU(x):
    return MyReLU.apply(x)
2 Likes

Does it also work when you wrap the model insider nn.sequential?

what’s the difference between defining a class as autograd function of as nn.Module?

Yea, I think your error comes from some other place…

class M(nn.Module):
	def __init__(self):
		super(M, self).__init__()
		self.s = nn.Sequential(nn.ELU(), weightedReLU())
		
	
	def forward(self,x):
		return self.s(x)
	
m = M()
y=m.forward(torch.ones(5).requires_grad_())
y.sum().backward()

(your code untouched, except I removed cuda())

Function is not related to nn.Module, you don’t even create instances of it explicitly, it is just a way to provide backward(). Module base class provides has all usual facilities for submodules, parameters, module tree enumeration etc.

1 Like

Thanks for your answer.

Suppose I need to include a learnable parameter in function autograd,
similar to the function below

class CustAct(torch.nn.Module):
    def __init__(self, alphas=1):
        super(CustAct, self).__init__()
        self.alphas = alphas*torch.nn.Parameter(torch.ones(1))

        self.alphas.requiresGrad = True  

     def forward(self, x):
            val = self.alphas.cuda() * x

        return val

How I can include this in autograd function?

Is there a way that I can remove cuda() so it can automatically differentiate between cpu and cuda? Even when I use CustAct.cuda(), still I need to explicitly map self.alphas to GPU.

class CustAct(torch.nn.Module):
    def __init__(self, alphas=1):
        super(CustAct, self).__init__()
        self.alphas = torch.nn.Parameter(torch.full((1,), alphas))
        self.register_buffer("scale", torch.tensor([alphas]) #if you want to separate non-trainable constant, multiply in forward()
        self.scale = alphas #works too

     def forward(self, x):
        val = customAutogradFunction(x, self.alphas) #if you must provide custom gradients
        val = x * self.alphas #no need for autograd.Function
        val = x.clamp(min=0) #your MyReLU as is also needs no autograd.Function
        return val

that’s because of the way you used multiplication, your self.alphas became Tensor not nn.Parameter and was not moved. Another idiom is tensorA2=tensorA.to(tensorB), this changes device and dtype, but it is not usually needed.

1 Like

What is the effect of

in your code?
I am loading some checkpoints and with this method, the loader asks for the values of alphas. It there any work around for this problem?
How I could make a function in the way that it returns the gradients with respect to alphas, separately and not coupled with the other weights?

Is there a way that I can have separate alpha for each neuron? It seems that in this code, you used same self.alphas for all the elements of n dimensional array x.

that code was purely illustrative, register_buffer is just a way to store non-trainable tensors in a module (sometimes it is more conventient even for scalars).

Why you want to manually return gradients in the first place? You usually only do this with external (thus non-differentiable) calculations, e.g. c++ extensions, or if you have simplified/more efficient gradient formulas. In other cases you only write forward() for nn.Module and most operations on tensors know how to calculate their gradients.

For this code, does it mean that weights are non-trainable any longer, as they changed to tensor?

class weightedReLU(nn.Module):
   def __init__(self, weights = 1): 
        super().__init__() 
        self.weights = weights* nn.Parameter(torch.ones(1)) 
        self.weights = Variable(self.weights.data, requires_grad=True) 
  def forward(self, input): 
       ex = self.weights.cuda()*MyReLU.apply(input) 
       return ex

I don’t know what effects such a reassignment does, because that’s not an idiomatic pytorch code, to put it mildly. “Variable” is obsolete, and that line shouldn’t be there at all. For first assignment - yes, it may train, but your parameter won’t be registered.

I add the line for Variable, since otherwise it asked for
retain_graph=True
for backward.

It seems Variable will register the gradient of weights in the backward method.

This code and yours produce completely different results.

No, that’s probably error that you get if you do backward() twice.

Actually, weights* nn.Parameter(torch.ones(1)) won’t train as optimizer won’t find this parameter, so there is that. Oh, and then you may get that error. What a mess :slight_smile:

My code with nn.ReLU works fine and with this custom activation after introducing Variable works fine as well (if I backward twice, it should be problematic with nn.ReLU too).

if you look at this post, it is suggested to do multiplication for the trainable variable.

This creates untracked parameter - it is not in module, but in backward graph. Optimizer doesn’t process it, so you get error on second iteration.

self.weights = Variable(self.weights.data, requires_grad=True) 

this probably just makes first assignment have no effect, and is equivalent to

self.weights = nn.Parameter(torch.ones(1) * weights)

or torch.full((1,),weights)

list(LearnedSwish().parameters())
[]

good luck optimizing that :slight_smile:

I tried this and print the weights. It seems that they are not training, and stayed constant. Even when I load the check points, it does not ask for the values of those parameters across the layers.

For this one

self.weights = nn.Parameter(torch.ones(1) * weights)

is there a way that I impose the loader to ignore asking for the values in the case of loading from a check point with conventional nn.ReLU activation?

not sure, there is nn.Module.load_state_dict(torch.load(PATH), strict=False), if you’re using training loop frameworks, check their docs…