Should custom Dataset always return tensors?

AlexisW · September 21, 2018, 3:16am

I have an integer to be returned, along with some other stuff. Should I just return the integer, or return something like torch.LongTensor(num)?

alwynmathew · September 21, 2018, 4:10am

If you are return something with grad, you shouldn’t convert it as it will be used for back prop. But if its just for visualization that value, you could do it after backprop step.

AlexisW · September 21, 2018, 4:12am

This is the data so it is neither grad nor visualization. The data will be used in forward step for the model.

alwynmathew · September 21, 2018, 4:18am

Also if its data for forward step, you should keep it as Tensor type as while backprop needs grad. Pytorch backprop operations are defined only for Tensor type.

AlexisW · September 21, 2018, 4:20am

Have you ever tried to return an integer itself? That will also be converted to a tensor…
The question then is whether we want to explicitly convert that.

alwynmathew · September 21, 2018, 4:31am

I have an integer to be returned, along with some other stuff.

Why do you want to return this integer? What does this integer contain? Is it the model output? Is the model input type Tensor?

Have you ever tried to return an integer itself?

Yes. If you try to backprop with this returned int it will thrown an error as it doesn’t have grad.

That will also be converted to a tensor

Why are you converting back and forth to int and Tensor?

Please give more details so that we could help you.

AlexisW · September 21, 2018, 4:43am

Let us say if you want to do a rnn model, you will need to have a padded sequence and its original length (that comes to the integer). I got no issue when just returning an integer so I am not sure why you are getting the error.

alwynmathew · September 21, 2018, 4:53am

I got no issue when just returning an integer so I am not sure why you are getting the error.

Maybe this post will make things clear about backprop problem I mentioned.

AlexisW · September 21, 2018, 5:28am

I did not get this issue. I think we were probably making the problem complex so I’d try to rephrase it. The fundamental question here is, when making a customized dataset, what is the best way to return the sequence (with variable lengths) and its length?

For example, it would be something like

input: ["some string", "some other string" ...., "final string"], with a max length max_len
embedding mapping: {"some": 0, "string": 1, ...}
task: return embedded sequences, with their original lengths if doing padding.

Hopefully this is clearer.