Beginner: Should ReLU/sigmoid be called in the __init__ method?

(S) #1

I am trying to rebuild a Keras architecture in pytorch, which looks like this

    rnn_layer1 = GRU(25) (emb_seq_title_description)
    # [...]
    main_l = Dropout(0.1)(Dense(512,activation='relu') (main_l))
    main_l = Dropout(0.1)(Dense(64,activation='relu') (main_l))
    output = Dense(1,activation="sigmoid") (main_l)

So I tried to adjust the basic RNN example in pytorch and add ReLUs to the Linear layers. However, I am not sure if I can call ReLU directly in the forward method or should call it in the init method.

My first try looks like this:

import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)

    def forward(self, input, hidden):
        combined =, hidden), 1)
        hidden = nn.ReLU(self.i2h(combined))
        output = nn.ReLU(self.i2o(combined))
        output = nn.sigmoid(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

n_hidden = 128
n_letters = 26
n_categories = 2
rnn = RNN(n_letters, n_hidden, n_categories)

However, when I look at the rnn object in python, I will not see the ReLUs, so maybe it’s not right to call nn.ReLU directly in the forward method…

  (i2h): Linear(in_features=154, out_features=128, bias=True)
  (i2o): Linear(in_features=154, out_features=2, bias=True)


Since nn.ReLU is a class, you have to instantiate it first. This can be done in the __init__ method or if you would like in the forward as:

hidden = nn.ReLU()(self.i2h(combined))

However, I would create an instance in __init__ and just call it in the forward method.

Alternatively, you don’t have to create an instance, because it’s stateless, and could directly use the functional API in forward:

hidden = F.relu(...)

(torcher) #3

let me repeat the question differently: what is the PyTorch-idiomatic way to use the relu() and WHY? I think the answer is to use F.relu() in the forward() function. The WHY part is important here and I’d love to hear a full answer. Hope I’m not complicating things more than necessary.


My personal preference is to use the functional API in the forward for stateless objects, e.g. F.relu.
Since nn.ReLU doesn’t store any parameters it’s not really necessary to define it using the module way.

However, there is one exception when I prefer the module init and that’s when I know I will try out different activation functions in the forward pass.
So instead of changing the F.relu to another non-linearity repeatedly, I use a definition like in this example:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 3, 1, 1)
        self.act = nn.ReLU()
    def forward(self, x):
        x = self.act(self.conv1(x))
        return x

and just switch self.act for another module.

I think it just comes down to your personal preference, since I think neither of both coding approaches is bad in any way.

(torcher) #5

my point isn’t really about coding style per se, but more about correct PyTorch semantics. The motivation for my followup question is more like the o.p. : What goes into init() and what goes into forward()? as I found that confusing as well, especially coming from Keras. I think it boils down to wether a layer (module??) holds state or not as you put it. Maybe there are other reasons I have not learned yet??. For example, you can even add new layers in the forward(), e.g. F.conv2d(…), but that layer weights are lost once the forward() exits, so it can’t be used as part of the trained model for example at prediction time. So the semantics -not the syntax- of what goes into init() vs what [should] go into forward() is the main point of the question.


OK, I see.
Let me try to boil it down to the standard approach and possible advantages of changing this approach.

In the standard use case, you are writing some kind of model with some layers. The layers hold most likely some parameters which should be trained. nn.Conv would be an example. On the other side some layers don’t have any trainable parameters like nn.MaxPool. However, usually also these are created in __init__. Other “layers” don’t have parameters and can also be seen as simple functions instead of a proper layer (at least in my opinion) like nn.ReLU.
In your forward method you are creating the logic of your forward (and thus also backward) pass.
In a very simple use case you would just call all created layers one by one passing the output of one layer to the other. However, since the computation graph is created dynamically, you can also create some crazy forward passes, e.g. using a random number to select a repetition of some layers, split your input into different parts and call different “sub modules” with each part, use some statistics of your input to select a specific part of your model, etc. You are not bound to any convention during this as long as all shapes etc. match.
For me this is one of the most beautiful parts of PyTorch. Basically you can let your imagination flow without worrying too much about some limited API which can only call layers in a sequential manner.
And this is also the reason to break some of these conventions I’ve mentioned before.

Think about a specific use case where you would like to use a conv layer, but for whatever reason you need to access and maybe manipulate its weight often. The first approach of creating the layer in __init__ and applying it in forward would certainly work. However, the weight access might be a bit cumbersome.
So how about we just store the filter weights as nn.Parameters in __init__ and just use the functional API (F.conv2d) in the forward method. Would that work at all? Sure! Since you’ve properly registered the filter weights in __init__, they will be trained as long as they are used somewhere in the computation during your forward pass.
As you can see these are somewhat advanced use cases and I wouldn’t say they are breaking some kind of PyTorch semantics. Using the functional API is totally fine for advanced use cases. I would not recommend to use the functional API for every layer from now on. It’s much easier to use nn.Modules in most use cases.

Have a look at the implementations of torchvision.models.
You will see all kind of different coding styles depending on the complexity of the problem. While simpler models might be implemented using some nn.Sequential blocks in __init__ and just calling them in forward (e.g. AlexNet, other models will be implemented in a different manner using more functional calls (e.g. Inception since it’s a bit more complicated to split and merge the activations as well as getting the aux loss).

To sum it up, my two cents are: use whatever feels good and easy for you. Although there are some “standard” approaches for some use cases, I have to say that even after working with PyTorch for a while now, I probably change the one or other coding style every few weeks (because I suddenly have the feeling the code logic is easier to follow using this new approach :wink: ). If you ask 10 devs to implement Inception, you’ll probably get 10 different but all beautiful and useful implementations.

(torcher) #7

Thank you for the time and effort in providing a great answer. Now that I understand your advice more deeply I think I gained a lot more intuition into PyTorch. Much appreciated.