Do pytorch support multiple outputs with the same inputs?

Dovermore · October 13, 2020, 8:48am

Do pytorch supoort the collection of “operations” to do during a single forward pass?
For instance:

class MyModel(nn.Module):
    def __init__(self, args):
        ... some setup here

    def common_operation(self, x):
        ...
        return common_val
    
    def operation1(self, x):
        tmp = common_operation(x)
        ... some computation
        return val1

    def operation2(self, x):
        tmp = common_operation(x)
        ... some computation
        return val2

model = MyModel(...)
model.operation1(x)
model.operation2(x)

Is it possible for torch to reuse values computed during operation1 when computing operation2 if the inputs are the same? Other than say returning them from an operation3 that combines 1 and 2? IK this is more like a tensorflow thing with static computation graph. But are intermediate values stored and can they be somewhat reused?

Thanks.

JuanFMontesinos · October 13, 2020, 9:10am

Hi,
Pytorch tracks the gradients at a tensor level. Sooo if you just “store” the variable, it can be reused.
The only drawback of the code you have posted is it is recommended to run everything inside the forward method.
so ideally you would have:

class MyModel(nn.Module):
    def __init__(self, args):
        ... some setup here

    def common_operation(self, x):
        ...
        return common_val
    
    def operation1(self, x):
        tmp = common_operation(x)
        ... some computation
        return val1

    def operation2(self, x):
        tmp = common_operation(x)
        ... some computation
        return val2
   def forward(self,x):
        model.operation1(x)
        model.operation2(x)
model = MyModel(...)

You can even have several models and pass the output of one of them as the input of another. Since the gradients are computed at tensor level you don’t really need to use a nn.Module. You can have a dirty script which just use tensors as weights and it would work.

Dovermore · October 13, 2020, 9:27am

Hi, Juan

So in this code the common_operation will be invoked twice but during the invocation of operation2 the output of common_operation is somewhat cached?

From my reading of autograd, the gradient accumulates over multiple calls of the model. So I thought in this code 1. there will be redundant computation of common_operation (computed one more time). 2. Two computation graphs are generated. 3. Gradients about common_operation are cumulated.

How does the ‘store’ you are referring to works? Thanks.
(I didn’t wrap it in forward cuz in the model is a generative model with many sub-modules, and there isn’t a clearly defined forward for the model)

JuanFMontesinos · October 13, 2020, 10:32am

You can just extract the common operation from the operation one and pass it as input.
That would be the simplest (and straightforward) thing you can do.

Let me clarify couple of things.
Autograd accumulate gradients over multiple calls of backward.
Think that if you call a submodel twice, you are invoking creating a siamese-like network.

You can do a trick like this:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.flag = False

    def common_operation(self, x):
        print(f'Common operation done? {self.flag}')
        if self.flag:
            return self.tmp
            self.flag = False
        else:
            self.tmp = x ** 2
            self.flag = True
            return self.tmp

    def operation1(self, x):
        tmp = self.common_operation(x)
        print(tmp)
        print(tmp._version)
        print(id(tmp))

        val1 = x - tmp
        return val1

    def operation2(self, x):
        tmp = self.common_operation(x)
        print(tmp)
        print(tmp._version)
        print(id(tmp))
        val2 = tmp * x.mean()
        return val2
model = MyModel()
x1 = torch.rand(10).requires_grad_()
x2 = torch.rand(10).requires_grad_()
o1 = model.operation1(x1)
o2 = model.operation2(x2)

s2 = o2.mean()
s1 = o1.mean()

s1.backward(retain_graph=True)
print(f'X1 gradient{x1.grad} before X2')

s2.backward()
print(f'X1 gradients{x1.grad}')
print(f'X2 gradients{x2.grad}')

Common operation done? False
tensor([0.2605, 0.1031, 0.8378, 0.8229, 0.9469, 0.3439, 0.9053, 0.2633, 0.5918,
        0.9358], grad_fn=<PowBackward0>)
0
139697179178544
Common operation done? True
tensor([0.2605, 0.1031, 0.8378, 0.8229, 0.9469, 0.3439, 0.9053, 0.2633, 0.5918,
        0.9358], grad_fn=<PowBackward0>)
0
139697179178544
X1 gradienttensor([-0.0021,  0.0358, -0.0831, -0.0814, -0.0946, -0.0173, -0.0903, -0.0026,
        -0.0539, -0.0935]) before X2
X1 gradientstensor([0.0572, 0.0731, 0.0232, 0.0239, 0.0184, 0.0508, 0.0202, 0.0570, 0.0355,
        0.0189])
X2 gradientstensor([0.0601, 0.0601, 0.0601, 0.0601, 0.0601, 0.0601, 0.0601, 0.0601, 0.0601,
        0.0601])

As you can see the vesion of the tensor is the same, and it’s never modified. Even the object ID remains.
However, contribution to the gradients will depend on both operations.

It really depends on what kind of gradients are you looking for.
Anyway your case sound like you should remove common_op from op1 and op2, and pass the result of common_op as an input to op1 and op2.

Dovermore · October 13, 2020, 10:50am

Yep, this makes sense and is what I am suspecting. Thanks for the example. I was thinking if there were states inside operation that I can use to avoid doing this flagging (since with this approach, I have to reset the flag after each batch of inputs, and that might potentially make the code really messy if the training logic involves multiple different dataset.)

I think in TensorFlow, you can just make an operation group and feed input to get all the output of operations without duplicating any computation or graphs. Something like

operations_to_complete = [model.operation1, model.operation2]
val1, val2 = execute(model, operations)

JuanFMontesinos · October 13, 2020, 10:58am

Well I don’t really know the intrinsecs of TF.
Anyway in both cases (TF and PT), the gradients should depend on both outputs.

I don’t really know if TF is not duplicating the graph (to do so it should detect you are carrying out the same common_op with the same input)

Since there is no change in the tensor there is nothing to use as flag . Yet suggesting to externalize the common_op