Some Confusions regarding Pytorch


I have a few confusions where I am not understanding Pytorch.

  1. What needs to be wrapped in a variable? Why do inputs and outputs have to be wrapped in a variable? Is the wrapping still needed when using nn.Modules?

  2. When using nn.Modules, we can just pass input tensors to the model and use it, right? Is there still anything that we need to do to ensure that optimization and backprop works?

  3. LSTM implementations do not have any state right (tutorial by @apaszke) ? Their state is maintained by passing the hidden state as an input. So, to use LSTM I need to create a hidden variable which will be passed as input to the LSTM. Why does hidden state has to be wrapped in a variable? What steps do I need to do maintain the hidden state, i.e. does it need o be initialized, detached? I have seen many examples where they initialize and detach hidden variables and I am not getting why it needs to be initialized (isn’t zero init already done in Pytorch?) and why do detaching? When to do detaching?

  4. Before any new training sample, why set gradient to zero?

  5. nn.Modules’s LSTM has an API where it takes a 3d tensor. How does it accept the hidden state input? How to define and use the hidden variable for the LSTM in the nn.Module?

  6. I am trying to understand batching, suppose I want to use a dataset of size ~ 100MB. I can load the dataset into memory, but I am getting the message: you are trying to allocate 6GB RAM, buy more RAM from Pytorch. i assume this is happening because the pytorch model has to pass data around and hence even a small dataset of ~ 100 MB can take large space? How to resolve this issue? How to do batching in this case and why would it solve the issue?

Sorry if the questions seem too naive. Links to any relevant resources are welcome!!


I’ll try to answer some questions, but feel free to dig deeper, if anything is still unclear.

  1. You don’t need to use Variable since PyTorch 0.4.0 anymore. It was used to define tensors which require gradients, i.e. where autograd can automatically calculate gradients for them. In the current stable and preview release you can just use tensor as the base class. There are some special cases, e.g. where you require gradients in the input. In these cases, just pass requires_grad=True to the constructor of the tensor.

  2. The gradient calculation and optimization can be performed as long as you use PyTorch functions throughout the forward pass. If you call any numpy functions etc. you would need to implement the backward function yourself. So, in the usual use case, your claim is right. You just need to pass the tensors and it’ll work.

  3. Regarding the Variable question, see 1. If you detach a tensor, you make sure that autograd won’t calculate the gradients beyond this tensor. This is useful if you want to calculate the gradients for a certain amount of (time)steps, but don’t want to go further than e.g. 10 steps in the past.

  4. Gradients are accumulated in PyTorch. If you don’t zero out the gradients after each optimizer step, your gradients would be accumulated and potentially grow. You can use this behavior to e.g. simulate larger batch sizes by accumulating the gradients for a certain amount of smaller batches.

  5. I think it might be worth looking into this tutorial.

  6. If you pass the data to your model, the intermediate tensors will be calculated and stored as they are needed for the backward pass. You could use a smaller batch size if you run out of memory or use torch.utils.checkpoint to trade compute for memory.

1 Like

Thanks a lot for the detailed reply!!

In 3, I am still a bit confused as to why I need to separately maintain the hidden tensor. Isn’t the LSTM module supposed to do that, since it takes seq_length as an input (the 3d tensor: seq_len, batch_size, feature_size)?

So, in what use case (I suspect a single LSTM cell) should a lstm be passed a hidden variable? Is it the same LSTM as nn.Module’s or a different api?

Also, in 6, say, I have a large tensor for my entire input, how do I construct batches from that input, does Pytorch have any apis to do that?


You might want e.g. to pass the hidden state from one sequence to another one. The tutorial gives you a nice introduction and shows how to handle the hidden state.

You might want to use a Dataset and DataLoader to create batches with the option of using multiple workers. Have a look at this tutorial for more information.

1 Like