Is that Possible for Pytorch to Provide Convolution Function For Purely Tensors (Not Variables)? Important on Training Inference Based Unsupervised Learning Models

yubeic · February 28, 2018, 11:28pm

Here is my code:

filters = torch.randn(8,4,3,3).cuda()
inputs = torch.randn(1,4,5,5).cuda()
torch.nn.functional.conv2d(inputs, filters, padding=1)

Error: TypeError: argument 0 is not a Variable.

I don’t want to use autograd.Variable to wrap my tensors and treat each computation as a node, even it’s a very thin wrapper since I might loop each layer 1000 times. Why is a variable necessary for convolution operation?

yubeic · March 1, 2018, 2:24am

@richard @smth

It seems there is no way in PyTorch to do convolution without using Variable class, which is a little unfortunate. I’m going to set the flag requires_grad=False. Is there any big overhead if I’m going to loop a convolution and a deconvolution 1000 times to solve the coefficients for each lalyer and there are totally 10 layers? My understanding is that PyTorch will treat each operation as a node thus there will be 1000 nodes for each layer. Maybe I’m wrong. I hope PyTorch doesn’t treat this 10 layer network as a 10000 layers …

Afterall, I just need to loop the convolution and transpose convolution to solve an ISTA optimization problem for each layer. Only the convolution results matter and I will take care of the gradients myself for each layer.

Further, is that possible for Pytorch to provide a set of convolution and transpose convolution functions for purely tensors (not Variables) so that people working on inference based model don’t have to worry about the autogradients and nodes, which are not very appropriate concepts in these models. This will be much more friendly to develop inference based unsupervised learning methods, E.g. Matthew Zeiler’s adaptive deconvolutional networks. The convolution functions for tensors seems quick easy to provide. Or if the current PyTorch mechanism is already sufficient, I feel it deserves a slightly more detailed description on how to implement inference based layers without introducing big overhead.

albanD · March 1, 2018, 11:33am

Hi,
From the next release onward, Tensors and Variables are going to be merged into a single object.
If you want to forward in a model (or just a convolution here) with minimal overhead when you will not use backward, then you should create your Variable with volatile=True.

vsag · March 1, 2018, 1:09pm

From the next release onward, Tensors and Variables are going to be merged into a single object.

Is there any reason for this? When will the next release come out?

albanD · March 1, 2018, 1:57pm

The reason is that the difference between Variable and Tensor creates a lot of misunderstanding and errors for new comers.
I don’t have the exact time for the release, but this change is mostly done in the master branch already.

vsag · March 1, 2018, 2:40pm

Oh but sorry to be pestering you, but wouldn’t that account for change of docs all the main tutorials on pytorch website ? I would like a brief overview of the change if that isn’t too much to ask.

albanD · March 1, 2018, 3:56pm

Yes that is going to be a major change and a lot of documentation will be heavily simplified (this will be done at the same time as the release itself).
I am not actively working on it, so I just have a high level point of view:
Breaking change are going to be kept at a minimum, so current code should keep working as is.
If you look at the current PR on the github repo, a lot of them are working on this change.
The main idea is that Variables are going to disappear from the python world. Everything is going to be a Tensor, or in nn a Parameter. And then the rest of the code is going to work as before, you won’t need to wrap things in Variables anymore.
For more details, I would advice looking at current PR that are made and asking here again so that people that know the details better can answer.

yubeic · March 1, 2018, 8:29pm

Thanks for the comments!

If I create my variables by the following code, can I assume this is a clean tensor in the current version?

filters = Variable(torch.randn(8,4,3,3).cuda(), requires_grad=False, volatile=True)
x = Variable(torch.randn(2,4,5,5).cuda(), requires_grad=False, volatile=True)

I would like to loops the following computation without creating an increasingly large subgraph:

for j in xrange(100000):
    y = torch.nn.functional.conv2d(x, filters, padding=1)
    x = torch.nn.functional.conv_transpose2d(y, filters, padding = 1)

I checked the GPU memory and it seems not increasing while the code is running and I guess it’s doing what I want. But I just want to double check since in my code a lot of this kind operations will be used. Also, in an inference based model, there usually is not a ‘forward’ direction since these model follows an analysis-by-inference principle. I hope the ‘forward’ in the previous reply doesn’t mean that the optimization is unrolled into many layers, I guess it won’t in the above setting. But can anyone confirm this?

In Tensorflow this can be done by using Parameter to build one layer graph, then use python control flow to loop the computation though implementing a fully inference based model is still quite hard in Tensorflow. But since PyTorch build the graph dynamically, I’m a little confused on how to achieve this safely.

Comment: Putting Variable under autograd package is apparently a choice with the pre-assumption that PyTorch will focus on error BP based models. This assumption makes PyTorch a little narrow minded. If PyTorch can provide clean convolutions for Tensors, all the people working on inference based vision models (RBM or general graphical models, adaptive deconvNet, sparse coding etc.) will probably benefit from it. Further it will make PyTorch much more general looking (as also a mathematical library) and yet this step is so easy to make.

SimonW · March 2, 2018, 5:31am

It’s not a clean tensor. It’s a variable that doesn’t track history. But it basically satisfies what you need.

Moreover, since you are using the functional interface. If your input and filters don’t require grad, then no history will be tracked anyways.

I’m not sure what you mean by optimization being unrolled into many layers. Optimization (you probably mean model optimization) is not even applied here. There is no history, no gradients, only forward results.

Why does dynamic graph imply being not safe?

You have some valid point, but it’s not really that we are narrow minded. It was a design choice to separate things that tracks history for BP (which became Variable) and things that don’t (which became Tensor). And the former is naturally in the .autograd namespace. And due to popularity reasons and code structure, things like conv layers are only directly supported on Variables. (you can also make them work directly on tensors with a bit of work.) Futhermore, this volatile=True option (you don’t need requires_grad=False if you set volatile=True) already gives you very similar experience to directly working on tensors. Moreover, as @albanD mentioned above, we have merged the two classes together. So I don’t really get the reason for this complaint.

yubeic · March 2, 2018, 6:19am

Thanks a lot for this very detailed explanation, Simon! @SimonW

I’m not sure what you mean by optimization being unrolled into many layers. Optimization (you probably mean model optimization) is not even applied here. There is no history, no gradients, only forward results.

‘Optimization’ was a typo, I actually mean computation. For inference based models, e.g. in Adapt DeconvNet, the coefficients in each layer are computed by solving an optimization, which requires many iterations of convolution and transpose convolution. Then the coefficients are sent to the next layer. In BP based networks, the coefficients are solved by a single convolution, which is forward. My previous concern was that if I manually implement the inference optimization, which is similar to the for loop I provided earlier, the underlying PyTorch implementation might build some graph or other structures. Building this graph for an inference optimization is like unrolling the optimization into many forward layers. According to what you said, it seems in this case (requires_grad=False or volatile=True) PyTorch won’t do anything heavy.

Why does dynamic graph imply being not safe?

Just as the above explanation, by ‘safe’ I really mean there is no additional expensive behaviors beyond just the tensor computation.

You have some valid point, but it’s not really that we are narrow minded. It was a design choice to separate things that tracks history for BP (which became Variable) and things that don’t (which became Tensor). And the former is naturally in the .autograd namespace.

Separating Variable and Tensor was great! That simplifies a lot of code migration from CPU numpy code to GPU Tensor code. This was one of the reasons a few peers switched from Tensorflow to PyTorch. Another big reason is this forum is awesome!

And due to popularity reasons and code structure, things like conv layers are only directly supported on Variables. (you can also make them work directly on tensors with a bit of work.)

It’s why I said ‘yet this step is so easy to make’. Actually I was thinking about writing a set of convolution functionals for Tensors. If you can give me some hints or pointers, that will be great! When you use Variable under autograd package to build an inference based model, it generally makes people worry about the graph or other overhead. I actually discussed this with a few other users of PyTorch and Tensorflow. Our general consensus is that when you try to build an inference model, this design makes things much less straight forward. And overall, only Tensors and convolutions (for Tensors) are needed.

Futhermore, this volatile=True option (you don’t need requires_grad=False if you set volatile=True) already gives you very similar experience to directly working on tensors.

Okay, then I will use this while waiting for the next version.

Moreover, as @albanD mentioned above, we have merged the two classes together.

I didn’t completely get this. Does that mean when they are merged, there will be a set of convolution function provided for clean tensors? If not, I would like to start to implement these functions though I might need some minimal instructions.

So I don’t really get the reason for this complaint.

Maybe narrow minded was a bit too strong. (We actually discussed what was the right word to express the feeling. It’s really just a wish that PyTorch can become a better framework for theoretical modeling development.) We actually think PyTorch did a great job on the Tensor part. If a set of convolution functions are provided for clean Tensors, the framework is going to be much more perfect to support many inference based models. If my complaint was too strong, I apologize for that since PyTorch really did a good job.

albanD · March 2, 2018, 12:56pm

Hi,

I think the main worry you have with using Variables everywere is the overhead it could imply compared to use pure Tensors directly? This has been looked into in details and in the current master branch, the overhead of using a Variable (with requires_grad=True or torch.no_grad()) is negligible. So you should not worry too much about the fact that you use Variables (or the merged version soon ).
I guess you want to look at the new Tensor as not only an n-d array, but an n-d array that can optionnaly keep track of operations (so that you can use it with an autograd engine).
Did I miss another worry you had?

yubeic · March 2, 2018, 8:05pm

I think the main worry you have with using Variables everywere is the overhead it could imply compared to use pure Tensors directly? This has been looked into in details and in the current master branch, the overhead of using a Variable (with requires_grad=True or torch.no_grad()) is negligible.

Actually I was only worrying about the requires_grad=False case since the inference based models I’m working on only needs tensors and convolutions. I just want to make sure that when I turn off the requires_grad option I can treat the variables as tensors and loop however many times without thinking about the additional structures. maybe “Premature optimization is the root of all evil.”

Did I miss another worry you had?

Thanks a lot for all of these comments and explanations. Thanks to @SimonW, too. I think I have a much better idea and turning off the gradient is a solution for now. The new release of PyTorch sounds really exciting!

0phoff · March 5, 2018, 8:05am

Hey,

I know this thread is already solved, but I stumbled here and thought this page in the pytorch docs belonged here. I am actually quite surprised nobody posted this as a solution already!

It clearly explains the role of Variable and how you can disable the tracking of history through volatile=True.
Always wanted to be able to tell someone to RTFM.

Have a great day!

yubeic · March 5, 2018, 6:36pm

@0phoff

Thanks for posting this link here. That page in the doc helps but doesn’t solve my initial question automatically. Prorogation of the flags wasn’t my concern since all of my variable will have the gradient flag off. My initial question was different from what’s described in the doc and it’s a common problem of using nearly all of the popular deep learning frameworks when you try to implement an inference based model since these frameworks were developed biased towards error BP based models (for an obvious reason after Alexnet). In Tensorflow we need to do other hacks to get around this.

But inference based models are still important given many key innovations in even error BP based deep learning models came from these models. These two kinds of models are deeply related beyond some superficial difference. My post was also to suggest building a better support for inference models. If you work on these models, I think you’d already know what I’m talking about. If not, Matthew Zeiler’s adaptive deconvnet is one of my favorite papers on this track in case you are interested.

The current design of PyTorch is not bad at all. (Except it’s a little confusing that using GPU convolution on Tensors needs a Variable under autograd module.) Turning off the gradient seems have solved the problem nicely and my early implementation went well. However, the new release sounds a lot more straight forward.

0phoff · March 6, 2018, 6:22am

Hey,

I guess I must have misinterpreted your question then.
I thought you were looking for implementing convolution networks that worked without saving the computations for performing the BP algorithm. (aka executing them on tensors and not variables)

The documentation states that if you use volatile = True it won’t save the computational graph needed for BP, which is basically the same as executing the convolution on tensors…

Anyway, thanks for the paper. It’s always nice to read about adjacent fields of research, to broaden my own understanding of ML.