[solved] torch.Tensor._version why 1 for class attributes (e.g. self.x)?

During dealing with the in-place operation problem of autograd, I faced sth like below.


class MyModel(nn.Module):
    def __init__(self, *args):
        super(MyModel, self).__init__()
        self.device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.X = torch.ones(200).long().to(device)

    def forward(self, *args1):
        print(self.X._version) # prints out 1 not 0
        print(torch.ones(1).long().to(self.device)._version) # prints out 0

        #whats happening?

1.Why is it happening?
2.Is it a good practice (or sth desired to be done) to maintain the torch.Tensor._version during the forward pass (say if it starts with 1 then it maintains 1 until it meets loss.backward in train.py ?


Why do you worry about the version being changed?
I would say it’s most likely changed when doing the weight initialization?

@albanD You mean that(_version changing during the forward pass) hardly happens in the most situations right?

Actually I want to use something similar to tf.placeholder in my model for sake of performance issue. And slicing and indexing to change its values are messing up torch.Tensor._version.

reason for using placeholder even with pytorch
The problem I’m dealing with is seq2seq based language model but it deploys the random variables labeled to each tokens and their intermediate vector representations used for every timestep such that I cannot just run rnn with packed_sequence or anything similar to it.

Thus, I’m running it token by token but trying to run it with minibatches of examples (like batches of tokens by tokens). So in my model.py looks sth like below

def __init__(self):
    #init placeholder
    self.predicted_tokens_placeholder = torch.zeros(batchsize, maxlen, vocabsize).to(device)

def forward(self, *args):
    #updating placeholder with predicted token
    ph = self.predicted_tokens_placeholder.clone()
    ph[:, tstep] = b_tokens_predicted
    other self.variables are used here similarly, requires grad=True,
    even without .clone() for carrying intermediate representations 
    #indexing self.variables like above example happens alot.
    #oops those are in-place ops --> autograd problem occurs
    return ph

And I found out (just today) to make autograd work correct, I need to avoid in-place operations. That is the reason I wonder about torch.Tensor._version stays intact during the forward pass in general.


_version changes whenever you do inplace operations on a Tensor.
But there is a good reason for that, it’s because changing that Tensor’s value could lead to wrong result from the autograd.

What is the reason why you cannot append the predicted tokens in a python list and then cat them at the end?

1 Like

Thx. I think I wrote codes w/o considering autograd in-place condition (now it looks like numpy or tf rather than torch). Hope I don’t need to redesign the code but just replacing in-place ops with torch supported equivalents for now

Yes it is a bit tricky.
To be clearer the change I propose is the following, hope this is easy enough to do in your code.

# before
placeholder = torch.zeros(1000, 5, 5)
for idx in range(1000):
    some_tensor = torch.rand(5, 5)
return placeholder

# after
results = []
for idx in range(1000):
    results.append(torch.rand(5, 5))
return torch.stack(results, 0)

Note that because of the custom allocator used by pytorch, the second one won’t be slower by any significant amount compared to the first one :slight_smile: even if you create the big tensor only once and reuse it in the first case !

Thx @albanD I didn’t know that there was custom allocator backing up behind the scene.

In my case, with the assumption that codes executes line by line even in the GPU computation, I just used placeholder.data[:, position] = some_tensor instead of placeholder[:, position] = some_tensor. It made me circumvent the autograd problem.

I know I need to check if autograd works as expected but I learn a new thing from your comments! Thx alot! :smiley:

Do not use .data ! it breaks the autograd !
The way .data was used before is now replaced by .detach() or with torch.no_grad().

OMG @albanD Thx a lot! I will follow that. Thank you so much again! :smile:

1 Like