Porting code from Thano/Lasagne to PyTorch

Ujku_KU · February 5, 2019, 1:35am

Is there anything in PyTortch similar to theano.function()

import theano
x = theano.tensor.dscalar()
f = theano.function(, 2*x)
f(4)
… array(8.0)

Tony-Y · February 5, 2019, 4:39am

>>> import torch
>>> def f(x):
...     return 2 * torch.DoubleTensor([x])
... 
>>> f(4)
tensor([8.], dtype=torch.float64)

Ujku_KU · February 7, 2019, 2:23am

@Tony-Y thank you for answering my question, but your answer is not the one I was expecting and maybe because I used a simple example to present theano.function(). A more detailed explanation follows:

def function(inputs, outputs=None, mode=None, updates=None, …):
“”"
Return a class: callable object that will calculate outputs from inputs.

Parameters
----------
inputs : list of either Variable or In instances.
Function parameters, these are not allowed to be shared variables.
outputs : list or dict of Variables or Out instances.
If it is a dict, the keys must be strings. Expressions to compute.
mode : string or Mode instance.
Compilation mode.
updates : iterable over pairs (shared_variable, new_expression). List, tuple or OrderedDict.
Updates the values for Shared Variable inputs according to these expressions.

Tony-Y · February 7, 2019, 3:01am

>>> import torch
>>> import torch.nn as nn
>>> class F(nn.Module):
...   def __init__(self):
...     super(F, self).__init__()
...   def forward(self, x):
...     return 2 * x
... 
>>> f = F()
>>> a = torch.DoubleTensor([4])
>>> f(a)
tensor([8.], dtype=torch.float64)

Ujku_KU · February 7, 2019, 8:23am

@Tony-Y thank you again for the answer, but what you give is a static solution while the theano.function() is a dynamic one in the following sense:

we have data and a list of expressions [‘expression_01’, ’ expression_02’, expression_03’, …]

for example:

expression_01 = 2*x
expression_02 = 3^x
expression_03 = 3x^2 - 5
…

and when there is given the following:

x = theano.tensor.dscalar()
list_of_expression = [‘expression_01’]
f = theano.function([x], list_of_expression, …)
f(2)
…array(8.0)

but in another scenario we have:

x = theano.tensor.dscalar()
list_of_expression = [‘expression_01’, ’ expression_02’, expression_03’]
f = theano.function([x], list_of_expression, …)
f(2)
…array(8.0, 9.0, 7.0)

this is how I understand the theano.function(), hence you just adjust the ‘list_of_expression’ according to your needs and theano.function() does the rest for you dynamically, and I hope it is a little bit more clear. for you too.

Tony-Y · February 7, 2019, 8:54am

>>> expressions = {
...   'expression_01': lambda x: 2*x,
...   'expression_02': lambda x: 3**x,
...   'expression_03': lambda x: 3*x**2 - 5}
>>> 
>>> import torch
>>> import torch.nn as nn
>>> class F(nn.Module):
...   def __init__(self, expression):
...     super(F, self).__init__()
...     self.expression = expression
...   def forward(self, x):
...     return self.expression(x)
...
>>> a = torch.DoubleTensor([4]) 
>>> f = F(expressions['expression_01'])
>>> f(a)
tensor([8.], dtype=torch.float64)
>>> f = F(expressions['expression_02'])
>>> f(a)
tensor([81.], dtype=torch.float64)
>>> f = F(expressions['expression_03'])
>>> f(a)
tensor([43.], dtype=torch.float64)

Ujku_KU · February 7, 2019, 9:18am

@Tony-Y thank you very much for your time and help. This last answer is the closest one to what I was expecting. The only one requirement not fulfilled is that it is needed for the expressions to be fed one by one and not en block, but I believe that could be manged somehow in the init part with a ‘for’ loop, looping over all the expressions and assigning them in the following fashion:

for i, expression enumerate(list_of_expressions):
self.expression_0i = list_of_expression[‘expression_0i’]

and then at the return part we have:

     return self.expression_0i(x)

I mean something like this in general. Do you think it is doable?

tom · February 7, 2019, 9:43am

The core difference between PyTorch and Theano you’re wondering about here is that in Theano you create a symbolic graph that you then feed into function to have it compiled to a function you can call while in PyTorch you write your calculation and PyTorch runs it as you write.

Modules are decidedly only there to hold learnable parameters / state - see Jeremy Howard’s recently added tutorial.

Now you could assemble lines of Python and then eval it to form your function, but quite likely, you’re not making the best use of PyTorch that way. One of the things people like about PyTorch is that you don’t have the create graph -> compile -> run workflow.

Best regards

Thomas

Ujku_KU · February 8, 2019, 7:09am

@tom hi Thomas and thank you very much for reinforcing my knowledge about the differences between Theano/Lasagne and PyTorch.
I am an engineer and my main goal is to find a solution for my project. It is not that I am writing a code from scratches in PyTorch. As you have noticed my problem now is porting a code from Theano/Lasagne to PyTorch. I am doing this because in PyTorch it is easier to debug and there is more support, and I am experiencing this myself even communicating with you right now. In this porting procedure I would preferred to change the original code as less as possible up to that degree that gives me the possibility to debug it easier.
I understand that the workflow philosophy behind PyTorch is different from that of Theano/Lasagne. Telling you the truth PyTorch workflow is the one that I am used to, and it took me a while till I understood the workflow of Theano/Lasagne. Some times (I mean most of the times :-)) people do not have the luxury to be picky, they just have to float with the current. Even in my case I do not have the luxury of being picky in that sense that: no this is not exactly PyTorch, that is half Theano half PyTorch. With this I mean no offense for anyone else that is stringent in crossing the borders between the two libraries. My main goal is to complete my project with any reasonable mean possible.
I am trying to explain that I know this is not the best way to write a code.
Please know that I appreciate very much the help and advises offered by you guys all.
Cheers.
Ergnoor

tom · February 8, 2019, 7:40am

Oh, sorry, I don’t want to make the impression to tell you how and how not to usw PyTorch. It was my impression that maybe you were looking for something more elaborate because the typical PyTorch transposition of that type of code can look suspiciously simple.
When I last did similar things, I tried to just write all the steps between the definition input variables later specified in of the Theano function call and the output in one regular Python function using PyTorch arithmetic. This looks a lot like Tony’s first example (except that the DoubleTensor constructor is probably not a good idea and you’d just use x there). In a way this should be very similar to what Theano does except that Python’s function declaration takes the place of Theano’s function call.
If there is code you find particularly difficult to translate, I’m sure we’ll try to help you out.

Best regards

Thomas

Ujku_KU · February 8, 2019, 8:58am

@tom Hi Thomas and thank you very much for your understanding.
Cheers.
Ergnoor

Ujku_KU · February 19, 2019, 3:57am

@tom Hi Thomas and hope you are doing alright.

I have a couple of questions in regard to the porting of code.

there is an optimizer that I have ported and I would like to have your opinion.

the Theano/Lasagne version:

def geoSGD(loss_or_grads, params, learning_rate):
“”“Geodesic Stochastic Gradient Descent (geoSGD) updates
Generates update expressions of the form:
* param := param - learning_rate * gradient
Parameters
----------
loss_or_grads : symbolic expression or list of expressions
A scalar loss expression, or a list of gradient expressions
params : list of shared variables
The variables to generate update expressions for (in our case they are: hh_W_u and hh_W_v)
learning_rate : float or symbolic scalar
The learning rate controlling the size of update steps
Returns
-------
OrderedDict
A dictionary mapping each parameter to its update expression
“””
grads = get_or_compute_grads(loss_or_grads, params)
updates = OrderedDict()
lr = learning_rate

for param, grad in zip(params, grads):
    W = param.get_value(borrow=True)
    G = grad
    A = T.dot(G, W.T) - T.dot(W, G.T)  # A = G * M.T - M * G.T
    I = T.identity_like(A)
    cayley = T.dot(T.nlinalg.matrix_inverse(I+(lr/2.)*A), I-(lr/2.)*A)  # (I + eta/2 * A)**(-1) - (I - eta/2 * A)
    updates[param] = T.dot(cayley, W)  # cayley * M = ((I + eta/2 * A)**(-1) - (I - eta/2 * A)) * M
return updates

my PyTorch version:

def geoSGD(outputs, params, learning_rate):
grads_ = torch.autograd.grad(outputs, params, retain_graph=True, allow_unused=True) # had to add ‘retain_graph=True’

updates = OrderedDict()
lr = learning_rate

for param, grad_ in zip(params, grads_):
    W = param.double()
    G = grad_.double()
    A = torch.mm(G, W.transpose(0, 1)) - torch.mm(W, G.transpose(0, 1))

    if torch.all(torch.eq(A.transpose(0, 1), -A)):
        if torch.sum(abs(A.transpose(0,  1) + A)) == 0:
            print('The matrix A is skew symmetric')
    else:
        print('The matrix A is NOT skew symmetric')
    
    I = torch.eye(A.size()[0],A.size()[1]).double()
    cayley = torch.mm(torch.inverse(I+(lr/2.)*A), I-(lr/2.)*A)  # (I + eta/2 * A)**(-1) - (I - eta/2 * A)
    updates[param] = torch.mm(cayley, W)  # cayley * M = ((I + eta/2 * A)**(-1) - (I - eta/2 * A)) * M
return updates

what I do not understand fully is that the ‘loss_or_grads’ have to be replaced by ‘outputs’ and, always if my porting is correct, how does the grad() function in PyTorch version knows the ‘loss’ to do the calculations and what ‘output’ is exactly in the case of PyTorch.

In this code among other things I have to deal with factorization of parameters and their gradients. Maybe I am wrong but for example if I have a situation like this:
W = U * S * V
F(W) = A * W + B
dF(W)/dU is not possible

only if I do like following
F(U, S, V) = A * (U * S * V) + B
then
dF(U, S, V)/dU is possible

Is there any way to have simultaneously
dF/dW, dF/dU, dF/dS, dF/dV (‘d’ all the time means partial derivative)

Thank you very much in advance for your time, help and understanding.

Cheers.

Ergnoor

tom · February 19, 2019, 8:25am

Hello Ergnoor,

great.

Ah, are you implementing orthogonal / unitary RNN by chance?
I did that once, but I didn’t put it in an optimizer.
I would recommend to look at torch.optim.SGD to see how you would implement optimizers in PyTorch.
If you want to do it manually, don’t calculate the grads_ but just take the param.grad from the params after someone called backward.
So you’d drop grad_ = ..., then you replace

for param, grad_ in zip(params, grads_):
    W = param.double()
    G = grad_.double()
    ...
    updates[...] = ...

with

with torch.no_grad(): # we don't actually want autograd in the gradient step
  for param in params:
     W = param.double()
     G = param.grad.double()
...
     param = torch.mm(cayley, W).float() # in stead of updates[param]=...

I’m not sure why one would not be possible or the other. Torch will track your ops all right. You can pass multiple things to differentiate by to grad at the same time or using backward will do the right thing, too. Note that you can only take derivatives of scalar functions.

Best regards

Thomas
P.S.: If you use triple backticks ``` before and after your code, you’ll get all your code formatted. That makes it much nicer to look at.

Ujku_KU · February 19, 2019, 9:15am

@tom Hi Thomas, and YES I am trying to implement Orthogonal / Unitary RNN.

Waw, such an elegant solution, only one thing: what do you mean with "after someone called backward " ? Do you mean that before calling this optimizer the code should have already called backward in order for param to have .grad calculated? And by the way now I understand why in the PyTorch optimizer it is needed to pass only param-s.
I will discuss with you further this point because I am in the middle of some testings, but it is good to have your opinion because now I know at least that there is a way to do it and I just have to search to find it.

I thank you very much again for your support.

Cheers.

Ergnoor

Ujku_KU · February 21, 2019, 8:16am

@tom Hi Thomas I have another question. When I tried the torch.nn.init.orthogonal_() like following:


# Python code to check  
# whether a matrix is  
# orthogonal or not 
  
def isOrthogonalN(a, m, n) : 
    if (m != n) : 
        return False
      
    # Multiply A*A^t 
    for i in range(0, n) : 
        for j in range(0, n) :  
            sum = 0
            for k in range(0, n) : 
          
                # Since we are multiplying  
                # with transpose of itself. 
                # We use a[j][k] instead 
                # of a[k][j] 
                sum = sum + (a[i][k] *
                             a[j][k]) 
          
        if (i == j and sum != 1) : 
            return False
        if (i != j and sum != 0) : 
            return False
  
    return True

a = torch.empty(3, 3)
a=torch.nn.init.orthogonal_(a)

if (isOrthogonalN(a, len(a), len(a[0]))) :
    print ("Yes") 
else : 
    print ("No") ```

I got as asnwer 

```NO``` 

When I tried with:

```a = [[1, 0, 0], 
    [0, 1, 0], 
    [0, 0, 1]]```

the answer was ```YES``` .

What do you think I am doing wrong here. I did not write myself the code ``isOrthogonal``` it is from ```GeeksforGeeks``` site.

Is there any function in PyTorch that I can use to check for orthogonality of matrices? 

Thank you in advance for your help.

Cheers.

Ergnoor

Ujku_KU · February 21, 2019, 8:22am

@tom sorry Thomas the code is as following:


# Python code to check  
# whether a matrix is  
# orthogonal or not 
  
def isOrthogonalN(a, m, n) : 
    if (m != n) : 
        return False
      
    # Multiply A*A^t 
    for i in range(0, n) : 
        for j in range(0, n) :  
            sum = 0
            for k in range(0, n) : 
          
                # Since we are multiplying  
                # with transpose of itself. 
                # We use a[j][k] instead 
                # of a[k][j] 
                sum = sum + (a[i][k] *
                             a[j][k]) 
          
        if (i == j and sum != 1) : 
            return False
        if (i != j and sum != 0) : 
            return False
  
    return True

a = torch.empty(3, 3)
torch.nn.init.orthogonal_(a)

if (isOrthogonalN(a, len(a), len(a[0]))) :
    print ("Yes") 
else : 
    print ("No") ```

tom · February 21, 2019, 8:32am

I think you’re seeing numerical precision:

a = torch.empty(3, 3)
a = torch.nn.init.orthogonal_(a)
almost_eye = torch.mm(a, a.t())
print((almost_eye - torch.eye(3)).abs().max().item())

gives something < 1e-6 or so.

Ujku_KU · February 21, 2019, 8:34am

yes I think so too because when I did this testing:

print(torch.mm(a, a.t()))

I got as output the following:

        [ 3.7639e-08,  1.0000e+00, -2.8114e-08],
        [-7.4154e-08, -2.8114e-08,  1.0000e+00]])```

Ujku_KU · February 21, 2019, 8:35am

sorry the output was:

tensor([[ 1.0000e+00, 3.7639e-08, -7.4154e-08],
[ 3.7639e-08, 1.0000e+00, -2.8114e-08],
[-7.4154e-08, -2.8114e-08, 1.0000e+00]])

Ujku_KU · February 21, 2019, 8:47am

@tom I replaced the line

if (i != j and sum != 0) :

with

if (i != j and sum > 1e-6) :

and it seems to work alright.

Thank you very much Thomas.

Cheers.

Ergnoor