I wanted to add 2 data tensors to the optimizer (data_words & data_phrases) but only (data_phrases) is optimized and not (data_words). Is there any mistake in the way I added them to the optimizer as in the sample code below (from inside the training loop of a softmax model)?
I see your point but I guess we can simply calculate the gradient of any tensor right? does not need to be involved directly in calculating the loss right?
Trying to check this point further, I tried to see the (.grad) of the tensors (data_phrases) and (data_words) and found that the gradients of the tensor (data_phrases) are updated unlike those of the tensor (data_words) it is always = None.
I guess the matter is related to a problem in passing the tensor (data_words) to the optimizer. For some reason only the (tensor_phrases) is passed correctly to the optimizer, but not the tensor (data_words). to check further, instead of trying to pass the tensors (data_phrases & data_words) one after one, I only passed one tensor (data_words) alone, and in this case it worked well and I saw gradients were calculated.
So my question turns to be: How can I pass multiple tensors together to the optimizer? I tried many times but it does not work.
I see your point but I guess we can simply calculate the gradient of any tensor right? does not need to be involved directly in calculating the loss right?
Well if the Tensor is not involved in the loss computation, it’s gradient will be 0s everywhere (or None, it is the same thing).
So to have meaningful gradients, you need to use the Tensor, otherwise, they will always be 0 and you won’t be able to learn anything.
Actually I have calculated the (data_phrases) from (data_words) but this was outside the training loop. So what I was intending to do is to pass to the optimizer the same batch of (data_words) that produced (data_phrases). So it is indirect relationship. so that the optimizer updates the “data_phrases” (directly used to calculate the loss) and “data_words” (indirectly related).
The point I passed the tensor (data_words) to the training loop while not being concerned with the softmax model is that I needed to pass it to the same optimiser, therefore I made it this way.
still no sense?
I see your point, but then why if i only pass to the optimizer the tensor (data_words) self.optimizer.add_param_group({“params”: data_words}), I find that the gradients are calculated successfully (and not None)?
Most likely because you also remove other code like the wrapping in the nn.Parameter() for data_phrases that disconnects data_phrases from data_words.
Actually I have calculated the (data_phrases) from (data_words) but this was outside the training loop.
If data_phrases is computed from data_words, why should it be in the optimizer? Shouldn’t you just update data_words and recompute data_phrases every time data_words is updated?
I could not understand this point (Most likely because you also remove other code like the wrapping in the nn.Parameter() for data_phrases that disconnects data_phrases from data_words.) I dont remove any thing indeed, exactly same code but instead of passing:
only (data_words) will be updated and (data_phrases) will give None, unlike the first case, which made me doubt about these 2 lines and how they are passed to the optimizer. I feel confused!
actually it is possible to ignore (data_phrases) and not pass it to the optimiser as you mentioned. but the point is that it is a softmax process for phrases (not for words), i.e., I can try to update only the tensor (data_words) and use the updated words to produce phrases. BUT, the loss function in this case will have as input the tensor (data_phrases) and not (data_words), which takes us back to your argument about optimizing a tensor not involved in the calculation of loss.
I have done all this but still not able to resolve the point! the (.grad) of words are still not calculated. So I need to pass the tensors of phrases and words together to the optimizer.