Convert int into one-hot format

Response777 · February 15, 2017, 7:52am

Hi all.

I’m trying to convert the y labels in mnist data into one-hot format.

Since I’m not quite familiar with PyTorch yet, for each iteration, I just convert the y to numpy format and reshape it into one-hot and then convert it back to PyTorch. Like that

for batch_idx, (x, y) in enumerate(train_loader):
    y_onehot = y.numpy()
    y_onehot = (np.arange(num_labels) == y_onehot[:,None]).astype(np.float32)
    y_onehot = torch.from_numpy(y_onehot)

However, I notice that the it gets slower each iteration, and I doubt it’s these code which might request new memory each iteration that makes the code slower.

So my question is, is there a more PyTorch way, which may help me avoid such conversion?

Thanks!

moskomule · February 15, 2017, 7:59am

HI, it depends on your loss function, but some PyTorch’s loss functions take class labels as their targets(e.g. NLLloss). So if you use them, you don’t need to convert targets into onehot vectors.

Response777 · February 15, 2017, 8:02am

Thanks for advice, I have seen such solution in the examples branch. However, I use such y as input, so that can’t solve my case.

albanD · February 15, 2017, 10:58am

Hi,

You can use the scatter_ method to achieve this.
I would also advise to create the y_onehot tensor once and then just fill it:

import torch

batch_size = 5
nb_digits = 10
# Dummy input that HAS to be 2D for the scatter (you can use view(-1,1) if needed)
y = torch.LongTensor(batch_size,1).random_() % nb_digits
# One hot encoding buffer that you create out of the loop and just keep reusing
y_onehot = torch.FloatTensor(batch_size, nb_digits)

# In your for loop
y_onehot.zero_()
y_onehot.scatter_(1, y, 1)

print(y)
print(y_onehot)

Response777 · February 15, 2017, 3:00pm

Thanks, that is exactly what I need!

Nadav_Bhonker · February 22, 2017, 10:11am

Isn’t there a more efficient way to input a “sparse Tensor” or a vector of indices into the network (specifically RNNs)?
I guess something similar to torch’s sparse linear (only for RNNs).

Response777 · February 23, 2017, 2:50am

Yes, I was using one_hot_encoding layer in Tensorflow, and it seems that there is no equivalent choice in PyTorch contemporarily.

apaszke · February 24, 2017, 9:23pm

@Nadav_Bhonker we’re working on adding more and more support for sparse operations, but the our fastest RNN backend (i.e. cuDNN) doesn’t support sparse inputs anyway. I’d recommend using Embedding for that.

ncullen93 · February 27, 2017, 6:52pm

just a note (from my understanding… maybe it doesnt apply in this case) it is currently advised to NOT follow this approach of creating the variable once and filling it each time (see How to use Batch normalization in testing model)

And also for future readers just to reiterate what user moskomule says- cross entropy and neg. log-likelihood losses in pytorch do NOT require one-hot encodings, so you can just use the normal target vector.

apaszke · February 27, 2017, 8:45pm

@ncullen93 the whole thread was about converting tensors I think, so it doesn’t apply there. But both of your statements are correct and are should be followed. Thanks

rajarsheem · March 5, 2017, 8:58am

How about overriding the default nn.Embedding weights data with torch.eye ?

emb = nn.Embedding(10, 10) 
emb.weight.data = torch.eye(10)

Done! Now, pass your batch containing indices to it.
emb(Variable(torch.LongTensor([[1, 2], [3, 4]])))
will give output as:
Variable containing:
(0 ,.,.) =
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0

(1 ,.,.) =
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0

apaszke · March 5, 2017, 10:24pm

@rajarsheem that’s not a very good idea if your vector dimensionality is large. You’ll end up storing a huge weight matrix in memory, and in your code emb.weight requires gradient and it might be subject to optimization if you don’t take care.

Additionaly, zero + scatter a few ones will be much faster than copying whole rows, of which most values are 0 anyway.

rogetrullo · March 21, 2017, 2:52pm

Hi @albanD adn @apaszke , I was trying to use the scatter function, but I am running into some troubles.
in my case I have something like this:

batch_size=10
y = torch.LongTensor(batch_size,5,5).random_() % 3#3 classes,5x5 img
y_onehot = torch.FloatTensor(batch_size,3, 5,5)#I want the one hot going through the chans dim
y_onehot.zero_()
ones=torch.ones(y.size())

y_onehot.scatter_(1,y,ones)

However, it gives me the following error
Index tensor must have same dimensions as output tensor at /data/users/soumith/builder/wheel/pytorch-src/torch/lib/TH/generic/THTensorMath.c:450

Could you help me with this? Thanks!

rogetrullo · March 21, 2017, 3:22pm

oh never mind, I just found that it works if I add a singleton dimension so that y and y_onehot have the same NUMBER of dimensions…

zeng · May 10, 2017, 2:28pm

It seems that the y can’t be a Variable or a cuda.FloatTensor.
How do I solve this TypeError?

albanD · May 10, 2017, 2:31pm

There is no reason for y to be a Variable here.
And since y is used to index in a tensor, it needs to have a proper indexing type: LongTensor.
So if you use cuda, y should be a cuda.LongTensor, not a cuda.FloatTensor.

zeng · May 10, 2017, 2:33pm

Sorry, I write wrong, it is also not available for cuda.LongTensor.

albanD · May 10, 2017, 2:34pm

is y_onehot a cuda tensor?

zeng · May 10, 2017, 2:38pm

Thank you, my problem. I ignored the y_onehot type is not a cuda tensor.

kakrafoon · October 9, 2017, 9:43pm

Thank you! I got my cvae implemented with this tip.