Nested list of variable length to a tensor

rustytnt · March 2, 2019, 1:32am

Hi all,
I am unable to convert my target variable which is a list of lists to a torch tensor. This is what it looks like:
target = [ [[1,2,3], [2,4,5,6]], [[1,2,3], [2,4,5,6], [2,4,6,7,8,]]]. In essence, each sublist is a token. I need the data in this form for the problem I am working on. I was able to pad the first list to the length of the longest list in my batch with zeros:[ [[1,2,3], [2,4,5,6], 0], [[1,2,3], [2,4,5,6], [2,4,6,7,8,]]], but I am unable to convert this to a tensor, instead I get this error:

x = np.array(batchY)
max_length = max(len(row) for row in x)
x_padded = np.array([row + [0] * (max_length - len(row)) for row in x])
x_padded

TypeError: can’t convert np.ndarray of type numpy.object_. The only supported types are: double, float, float16, int64, int32, and uint8.

Any thoughts on how I can fix this

DoubtWang · March 2, 2019, 7:13am

>>> import torch
>>> target = [ [[1,2,3], [2,4,5,6]], [[1,2,3], [2,4,5,6], [2,4,6,7,8,]]]
>>> torch.tensor(target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: expected sequence of length 3 at dim 2 (got 4)
>>> target = [ [[1,2,3, 0, 0], [2,4,5,6, 0], [0, 0, 0, 0, 0]], [[1,2,3, 0, 0], [2,4,5,6, 0], [2,4,6,7,8]]]
>>> torch.tensor(target)
tensor([[[ 1,  2,  3,  0,  0],
         [ 2,  4,  5,  6,  0],
         [ 0,  0,  0,  0,  0]],

        [[ 1,  2,  3,  0,  0],
         [ 2,  4,  5,  6,  0],
         [ 2,  4,  6,  7,  8]]])

there may be a better solution.

rustytnt · March 2, 2019, 7:30am

Hi @DoubtWang,
Thank you for your response! So I have to pad each inner list …despite it being a timestep…hmmm I’m going to give it a shot

rustytnt · March 3, 2019, 5:16am

@ptrblck any suggestions on how I can fix this issue?

ptrblck · March 3, 2019, 7:06am

This code might work:

target = [[[1,2,3], [2,4,5,6]], [[1,2,3], [2,4,5,6], [2,4,6,7,8]]]
max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_length - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)

rustytnt · March 3, 2019, 4:03pm

@ptrblck Thank you! it worked with a small bit of modifications:

target = batchY
max_length = max(len(row) for row in target)
max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_cols - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)

I also have one last question about how Pytorch embeddings work.
I often write my algorithms from scratch, but I am playing with using Pytorch’s built-ins.
However, lets say I pass an input tensor of shape [2, 3, 4]
( sequence length x batch size x vocab) into an embedding layer of [4,5],
mathematically I expect python to broadcast this over the non-matrix dimension ,
which in this case is 2. Now shouldn’t my output be of the shape 2 x 3 x5?
Instead I get 2 x 3 x4 x 5, matrix multiplication wise this is weird…do you know why this happens?

ptrblck · March 5, 2019, 1:46pm

It seems dim2 is some kind of one-hot encoded vocab?
If so, you should rather pass the vocab indices in a shape of [2, 3] to your embedding layer.
The result will then be [2, 3, 5] as expected.

rustytnt · March 6, 2019, 6:06am

@ptrblck Thank you for your help once again! Yes, dim2 is a one-hot encoded vocab. Ah… so you are basically saying not to one hot encode it prior to embedding?

ptrblck · March 6, 2019, 7:22am

Yes, instead pass the index values (4 instead of [0, 0, 0, 0, 1]), e.g. like in the target using nn.CrossEntropyLoss.

rustytnt · March 9, 2019, 10:12pm

I have tried padding these input sequences, however because Pytorch’s packed_padding requires sorting and order matters for this dataset. I am stuck on how this works.

inputs = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]

Any ideas on how to maintain order with nested sequences like these?
@ptrblck

ptrblck · March 9, 2019, 10:19pm

Could you explain a bit more how these values should be sorted?
The order won’t be changed, if you use my (fixed) code snippet:

target = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]

max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_cols - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)

rustytnt · March 9, 2019, 11:00pm

Thanks again @ptrblck I am looking at visit data by customer. So within each list is a list of visits for a specific customer. Ex.

customer 1( two visits) : [[1, 2, 2], [1, 2, 2, 3, 4]] 
customer 2( 1 visit):  [[8, 9, 10]]
customer 3 ( two visits): [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]

Within each visit are items ordered. So customer 1 ordered items [1,2,2] on visit 1 and items [1,2,2,3,4] on visit two.

If I one-hot encode it this is the order I expect to get and this works well… but it is manual

tensor([[
      Batch 1:   [0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
                    [0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0.],
                    [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.]],

     Batch 2:   [[0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
                     [0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0.]]])

Here the first visit of each customer is in the first batch and the second visit of each customer is in the second and so on… However, if I manually code this up I would have to create a mask etc… which is fine, but I am trying to try out Pytorch’s pack_padding approach instead. With the intent of getting the visits order maintained. How should I a nested list of visits?

Here is the encoding code:

seqs = inputs
lengths = np.array([len(seq) for seq in seqs]) - 1 # remove the last list in each cutomers's sequences for labels
n_samples = len(lengths)
maxlen = np.max(lengths)

x = torch.FloatTensor(torch.zeros(maxlen, n_samples, 12)) # maxlen = number of visits, n_samples = samples
y = torch.FloatTensor(torch.zeros(maxlen, n_samples, 12))
for idx, (seq,label) in enumerate(zip(seqs,labels)):
    for xi, visit in zip(x[:,idx,:], seq[:-1]):
        xi[visit] = 1.
    for yi, visit in zip(y[:,idx,:], label[1:]):
        yi[visit] = 1.

Thank you in advance!!

rustytnt · March 10, 2019, 12:05am

@ptrblck I ended up flattening the inner list of each sequence before padding and that seems to work.

[[1, 2, 2, 1, 2, 2, 3, 4], [8, 9, 10], [1, 2, 2, 3, 4, 1, 2, 2, 5, 6, 7]]

Since, the order is maintained within each customer’s sequence. Any thoughts?

ptrblck · March 10, 2019, 12:57am

While this might work, you will lose the different visits, won’t you?
I’m sure there is a better way to achieve the result you want, but this code should create your desired tensor:

target = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]


max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_cols - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)

padded = padded.permute(1, 0, 2)  # permute so that batch dim is dim0

size = padded.size()
padded[padded == 0] = padded.max() + 1  # add pseudo index
res = torch.zeros(*size[:2], padded.max()+1).scatter_(2, padded, 1)
res = res[:, :, :-1]  # remove pseudo index
print(res)
> tensor([[[0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.],
         [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.]],

        [[0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 0.]]])

rustytnt · March 10, 2019, 1:11am

Thanks again @ptrblck, I know I would lose visits unfortunately. However, is there no way to embed the encoded inputs without having to multi-one hot? Sorry for the confusion. But, I already have the one-hot solution, I want to instead use pytorch to directly embed the inputs before packing…is there any easy way to do this?

target = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]

Are you saying that the easiest way out is to just use the one hot approach?

ptrblck · March 10, 2019, 1:23am

Sorry, I might be slow today, but I’m not sure what you mean by “embed”.
I assume you are not talking about an embedding like the nn.Embedding module.

Maybe let’s first clarify what your goal and input data is and we can have a look at the utils.rnn methods and see, if they provide ready methods.

rustytnt · March 10, 2019, 1:31am

No problem. My input data is in a nest list of variable lengths. I want to directly use the encoded input list in the following steps:

input = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]

Pad these inputs:

padded = tensor([[ 1,  2,  2,  0,  0,  0],
        [ 1,  2,  2,  3,  4,  0],
        [ 8,  9, 10,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 1,  2,  2,  3,  4,  0],
        [ 1,  2,  2,  5,  6,  7]])

Use this embed this padded tensor : embs = nn.Embedding(vocab, embsize)
Pack : pack_padded_sequence(embs, seq_lengths.cpu().numpy())

and use it in a RNN…My question is what is the best way to deal with data of this format? Should I just one-hot encode it and make a custom model from scratch? Or can I use Pytorch’s pack_padded_sequence? However, the problem is the sorting step needed to use this torch util.

wangyanda · May 13, 2019, 2:04am

I have basically the same input and expected output, looking forward for an excellent solution @ptrblck thank you very much

ptrblck · May 13, 2019, 8:24am

Could you post some dummy input and output?
Is the proposed approach in this topic not working for some reason?

wangyanda · May 13, 2019, 10:36am

thanks for your response.
The situation is a little different here.
For the input, I have records for a batch of users in the shape of max_length*batch_size, and each element is a list itself representing the items the use choose in the corresponding time step. Each column represents the records for a user.
For example, a input could looks like this:
input=[
[[3,5,4], [8,5], [3]],
[[6], [6,4,3,5], [7,5,3]],
[[6,5],[2],[2]]
[[2],[0],[0]]
]
here the max_length=4 (the first column), the batch_size=3, and the sequence_length=[4, 3, 3] for the three users. All elements are lists with different lengths, representing different items a use choose once. As you can see, they are zero-padded. The first user take items forth, for the first time, he chooses [3,5,4], and the second time he chooses [6], then [6,5], and at last I put a [2] as the end-of-sequence token.

I want to use nn.Embedding to embed each element and take the average as the input of the RNN in a time step, so my expect output would looks like this with shape=max_lengthbatch_sizeembedding_dim:
output=[
[[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2]],
[[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2]],
[[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2]]
]
with sequence_length=[4,3,3], I could then employ torch.nn.utils.rnn.pack_padded_sequence to get the input for a RNN.

I can’t directly use nn.Embedding(input) since the input has irregular shape.

So what is the most “pytorch” way to do this ?
Thank you very much.