A more elegant way of creating the nets in pytorch?

deepbrain · January 6, 2018, 9:21pm

pytorch lets you create the nets in a more dynamic way. With the current code, one needs to define the sub-modules or their weights in the init call of the class and use them later in the forward call. This creates two separate pieces of code that one needs to change if the structure of the net is changed. Ideally I want a single line of code that both declares all needed structures and uses them for computations. This could be easy to implement by creating wrappers around the existing classes and the new code would be easier to read and manage. The idea is very simple and three classes below illustrate it:

# The standard way of doing things now in pytorch examples:
class TestNet(nn.Module):
    def __init__(self):
        super(TestNet, self).__init__()
        self.net1 = nn.Linear(100, 200)
        self.net2 = nn.Linear(200, 1)
        self.sigmoid = nn.Sigmoid()
        self.ReLU = nn.ReLU(inplace=False)
        self.drop = nn.Dropout(0.5)
    def forward(self, V):
        return self.sigmoid(self.net2(self.drop(self.ReLU(self.net1(V))))).squeeze()

what I don’t like in the standard way of doing things is that if I change the net, I need to adjust the init call to match the forward call manually, taking into account the net sizes and structures, which is inconvenient and error prone in case of bigger networks.

So, I create my nets in a more dynamic way, where both the structure and the computation is defined in the same place - the forward call, while passing a sample of data to the init call. I then call the forward from the init so the net is easily created with layers that match the input sizes or perhaps more complex submodules:

class DynamicTestNet(nn.Module):
    def __init__(self, datasample):
        super(DynamicTestNet, self).__init__()
        self.ml = nn.ModuleList()
        self.Make = True
        self.forward(datasample)
        self.Make = False

    def forward(self, V):
# notice that this whole block of code that creates the nets can be hidden from the user with small API changes        
        self.mlindex = 0
        if self.Make: # when this is called from the init call, we create the net based on the sizes of the data sample and intermediate reults
            self.ml.append(nn.Linear(V.size(1), 200))
            self.ml.append(nn.ReLU(inplace=False))
            self.ml.append(nn.Dropout(0.5))
        net1 = self.ml[self.mlindex]
        self.mlindex += 1
        ReLU = self.ml[self.mlindex]
        self.mlindex += 1
        drop = self.ml[self.mlindex]
# end of net creation block
           
        result = drop(ReLU(net1(V))))

#another subnet creation block that can be hidden        
        if self.Make:
            self.ml.append(nn.Linear(result.size(1), 1))
            self.ml.append(nn.Sigmoid())
        net2 = self.ml[self.mlindex]
        self.mlindex += 1
        sigmoid = self.ml[self.mlindex]
        self.mlindex += 1
# end        

        return sigmoid(net2(result)).squeeze()

While it looks like this second dynamic net is using more code, it is actually more convenient and manageable way of doing things for big nets that change their structure often and need to adapt to new types of data.

The idea: we could reduce the amount of the extra code, if we had another subclasses/wrappers of nn classes called “autonet” that removes the need for the extra code in the DynamicTestNet class. The “autonet” would do the net memorization/creation/autosizing task for us, so its init call has a new parameter - AutoCreate:

import torch.autonet as ann

class DynamicTestNet(ann.Module):
def __init__(self, datasample):
    super(TestNet, self).__init__(AutoCreate = True, datasample) # here it just calls the forward call with the datasample and remembers all of the nets created in the hidden ml list as illustrated in the class above
    
def forward(self, V):
    result = ann.sigmoid(ann.Linear(ann.Dropout(0.5, ann.ReLU(ann.Linear(V, 200)))))

notice that now we have far fewer lines than in the original TestNet class. When the “forward” is called from the “init” call, it would create the hidden modulelist ml as in the dynamic example above, initialize/create each sub-module of the net in such a way as to match the output size of the previous layer and remember them in the internal ml list automatically. In subsequent “forward” calls it would just use the ann.* submodules created before from the hidden ml list.

I think it could make pytorch nets even more readable and easier to manage, while further increasing the main pytorch advantage: the removal of decoupling of the declaration and the computation parts present in Tensorflow, theano, and other frameworks. Pytorch wins in my use cases because it makes a hugely more readable and debuggable code, that runs at the same speed or faster than the compiled/decoupled frameworks.

jpeg729 · January 6, 2018, 9:34pm

You might not be aware of the functional versions of the basic modules. For example, you use the nn.Linear class in your example. You could use the functional variant which requires no initialisation step. You just call it when you need it with data and weights.

output = torch.nn.functional.linear(input, weight, bias)

deepbrain · January 6, 2018, 9:51pm

Thank you for suggestion. In case of using the functional, you would still need to declare and keep the weights somewhere in the class separately. If the weights are declared in the Init call, it still has the decoupling of declaration and computation which is inconvenient with large code. If I declared the weight initialization inside the “forward”, it would slightly reduce the number of lines of code needed for initialization/declaration compared to my code above, but not eliminate it. Ideally I want no extra code needed to declare and use a network, like in my last example of the “autonet” above, where a single line of code does all of the declaration and computation.

erikwijmans · January 6, 2018, 11:23pm

Inheriting from nn.Sequential is a nice way to write less of the forward pass yourself.

class TestModule(nn.Sequential):

    def __init__(self):
        super().__init__()
        self.add_module("net1", nn.Linear(100, 200))
        self.add_module("ReLU", nn.ReLU())
        self.add_module("drop", nn.Dropout(0.5))
        self.add_module("net2", nn.Linear(200, 1))
        self.add_module("sigmoid", nn.Sigmoid())

    def forward(self, V):
        return super().forward(V).squeeze()

The modules added via self.add_module are called by super().forward in the order they are added.

You can then reduce this even more by creating higher level wrappers of common NN functions:

class TestModule(nn.Sequential):

    def __init__(self):
        super().__init__()
        self.add_module(
            "layer1", my_nn.Linear(100, 200, activation=nn.ReLU(), p=0.5)
        )
        self.add_module("layer2", my_nn.Linear(200, 1, activation=nn.Sigmoid()))

    def forward(self, V):
        return super().forward(V).squeeze()

deepbrain · January 7, 2018, 12:08am

Thank you, it does reduce the number of lines of code needed to write a class with a sequential processing. However, it does not eliminate the decoupling of the declaration of the network in the “init” call from its invocation in the “forward” call with a potential for a mismatch and code management issues in case of larger and more complex networks with structure that may depend on data samples. For example, if the “layer1” was a more complex non-sequential network with a graph structure that is produced by a non-trivial python function, its output size would be harder to guess/hardcode for the layer2 without actually invoking layer1 on the data first, so the “layer2” can not be declared with a static input size of 200, because you don’t know it before actually running the layer1.

Also, debugging of this code would be harder because you would not be able to examine the intermediate results of a forward call that invokes the whole stack of submodules. Ideally I want to be able to declare the whole network in the forward call without having to write a bunch of if statements and be able to examine all of the intermediate results line by line in the forward.

jpeg729 · January 7, 2018, 8:25am

You say you want a

Do you mean nets that must adapt their structure to the input data after they have been trained?

smth · January 7, 2018, 12:19pm

@deepbrain your proposal looks pretty clean and nice, but it’s limited to settings where structure is not dynamic, and to settings where input size is fixed (otherwise Linear layers’ dimensions are not fixed). I like the proposal quite a bit, let me discuss this with the other devs.

deepbrain · January 7, 2018, 4:19pm

these are the use cases that I want to improve, and it seems that I am not alone (see for example this recent post):

The network is fairly complex so, to know and specify the input dimensions of layer N+1, you really need to pass a sample of data through the layers 1…N. This happens before the training at the time when the network structure is specified/created.
You have several instances of train data sets with different properties/dimensions and don’t want to create a separate code for each of these sets, but rather create a net based on a sample of data. This also happens before the training.
You want to make the code in the init and forward calls more elegant and manageable by completely eliminating the separation of the declaration of the network from its actual invocation. The pytorch is already the best framework in this regard compared to Tensorflow, Theano and others, where the code is simply much harder to understand and debug because of the huge separation between the declaration and the actual computation.

apaszke · January 7, 2018, 4:30pm

While it’s certainly an interesting proposal, my concern is that it’s pretty much impossible to identify which modules does someone want to use. I’d imagine that the rule for re-using auto-created modules would be something like: keep a dict of buckets where keys are module types and values are lists of instantiated modules. Upon ith use of some kind of a module (let’s say Linear), go to the Linear instance list and pick the ith module. Use this to compute the function. There might be a better way that I didn’t think of, that’s just what I assume in the remaining part of this post. Now, consider these two cases:

“Stochastic depth”

def forward(self, x):
    if bool(random.randrange(2)):
        y = ann.Linear(ann.Linear(x, 200), 200)
    else:
        y = ann.Linear(x, 200)
    return ann.Linear(y, 200)

In this case taking the first branch will use Linear instances 0 and 1 to compute y, and instance 2 to compute the result. In case the second branch is taken, instance 0 will be used to compute y and the result will be computed using instance 1. I can imagine myself writing such code, and that certainly wouldn’t be my intention.

Reusing a module

Since modules are identified by their use, it’s impossible to use the same module twice (this is quite important in many models). This could be alleviated by naming the modules but I feel that this makes the API unnatural compared to just using self.my_module(x).

Thanks for the proposal and let me know what you think!

deepbrain · January 7, 2018, 4:31pm

@smth, yes, the structure would be fixed during training, I think it covers most use cases - see my post above. In cases when the users want something more dynamic, they would either need to use the current classes with IF statements or a dynamic network consisting of several sub-classes where the structure is fixed within a sub-class, but the switching happens in between the sub-classes at the top level structure.

deepbrain · January 7, 2018, 4:59pm

@apaszke, I was not very clear in explaining how it might work. The idea is to create a network in exactly the same way as we do now, but do it in the “forward” call instead of the “init” call. When we create the network in the forward call we remember the exact and static order of computation and modules used in the ModuleList. The order of module invocation does not change during the training and all items in the ModuleList remain the same. You can reuse any sub-modules and optionally share the weights (this could be specified as an additional parameter) to point to the same or a different instance in the ModuleList from an ann module. So, each instance of ann module is basically an integer number pointer to an instance in the static ModuleList that does not change during the training. This integer number is used internally just like the name of modules that are declared in the “init” call now. If the order of computation does not change, just like the JIT graph does not change during the training, the ModuleList would let you reference the sub-modules line by line in the forward call and allow an easy debugging during the training.

It is just a simple idea, I think there is a room for its improvement that would probably give even more expressive power and possibly something more dynamic without a need to pre-declare the structures.

jpeg729 · January 7, 2018, 5:50pm

A few random thoughts…

If the module order and layer sizes are fixed, then the only uncertainty is the input size, since if we know how many units there are in layer1, then we know what the input size of layer2 should be. Keras only requires you to specify the input size of the first layer. The rest are inferred at model creation time. That said there will be cases where you might want to have a layer whose size is a function of its input size, for example, a layer with input_size*10 units.
nn.LSTM can initialize its own hidden state at runtime, but you can’t specify a custom initialiser function. Maybe we could do something like that for the weight arrays.
Stochastic depth could be implemented with a random skip class that would consume a class from the ModuleList at each run, but choose at random whether to pass the input data straight through or whether to pass it through the module. A form of layer dropout if you like.

deepbrain · January 8, 2018, 7:16am

reusing a module with no weight sharing would be simple - just define a function

def my_module(x):

if the weight sharing is required, like Adam said it would be unnatural to pass an ID, instead we could use something like this when we define a module with shared weights:

@torch.ann.shared(‘my_module’)
def my_module(x):

where @torch.ann.shared would do the code instrumentation similar to what the JIT does now to save and restore the mlindex variable to the previous invocation of ‘my_module’. The ‘my_module’ parameter could default to a line number from the python’s inspect module.

Dynamic and stochastic nets should probably not be handled by ann. To avoid the bad cases like Adam presented in “Stochastic depth” example, we could set an ann.debug flag, which would check the execution context and raise errors like this:

import inspect

previous_frame = inspect.currentframe().f_back
frame = inspect.getframeinfo(previous_frame)
assert(mlindex == findIndex(frame.filename, frame.lineno))

Though I could imagine that with a bit more effort a similar code to the above could search for the appropriate mlindex based on the execution context and if not found, dynamically create the missing instance in the ModuleList. So, we could handle the dynamic nets too at a small expense of executing a few extra python lines inside the ann module.

-Art

Pawel_Subko · January 9, 2018, 9:44am

@deepbrain I think it’s terrible a’la tensorflow idea, where later you will be “calling modules by name” from some abstract global table of possible module names. Big nonono. Currently the modules you can use are declared in constructor and that’s way cleaner.

Simply these fazes ARE different and creating variables is not the same as making computations and we should keep it this way.

jpeg729 · January 10, 2018, 11:44am

I partially agree.

The net structure should be declared in __init__, but it would be nice if some of the sizing details could be fixed during the first forward run.

ahirner · January 22, 2018, 5:59pm

Agree with that. The most elegant convenience feature would be to fixate nn.modules after an auto-trace.
AFAIK tracing is already working well for ONNX export, is it? Auto-tracing should work with yet undimensioned nn.Modules (and with the usual caveats regarding stochastic control-flow).