Nested list of variable length to a tensor

Hi all,
I am unable to convert my target variable which is a list of lists to a torch tensor. This is what it looks like:
target = [ [[1,2,3], [2,4,5,6]], [[1,2,3], [2,4,5,6], [2,4,6,7,8,]]]. In essence, each sublist is a token. I need the data in this form for the problem I am working on. I was able to pad the first list to the length of the longest list in my batch with zeros:[ [[1,2,3], [2,4,5,6], 0], [[1,2,3], [2,4,5,6], [2,4,6,7,8,]]], but I am unable to convert this to a tensor, instead I get this error:

x = np.array(batchY)
max_length = max(len(row) for row in x)
x_padded = np.array([row + [0] * (max_length - len(row)) for row in x])
x_padded

TypeError: can’t convert np.ndarray of type numpy.object_. The only supported types are: double, float, float16, int64, int32, and uint8.

Any thoughts on how I can fix this

4 Likes
>>> import torch
>>> target = [ [[1,2,3], [2,4,5,6]], [[1,2,3], [2,4,5,6], [2,4,6,7,8,]]]
>>> torch.tensor(target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: expected sequence of length 3 at dim 2 (got 4)
>>> target = [ [[1,2,3, 0, 0], [2,4,5,6, 0], [0, 0, 0, 0, 0]], [[1,2,3, 0, 0], [2,4,5,6, 0], [2,4,6,7,8]]]
>>> torch.tensor(target)
tensor([[[ 1,  2,  3,  0,  0],
         [ 2,  4,  5,  6,  0],
         [ 0,  0,  0,  0,  0]],

        [[ 1,  2,  3,  0,  0],
         [ 2,  4,  5,  6,  0],
         [ 2,  4,  6,  7,  8]]])

there may be a better solution.

Hi @DoubtWang,
Thank you for your response! So I have to pad each inner list …despite it being a timestep…hmmm I’m going to give it a shot

@ptrblck any suggestions on how I can fix this issue?

This code might work:

target = [[[1,2,3], [2,4,5,6]], [[1,2,3], [2,4,5,6], [2,4,6,7,8]]]
max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_length - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)
3 Likes

@ptrblck Thank you! it worked with a small bit of modifications:

target = batchY
max_length = max(len(row) for row in target)
max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_cols - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)

I also have one last question about how Pytorch embeddings work.
I often write my algorithms from scratch, but I am playing with using Pytorch’s built-ins.
However, lets say I pass an input tensor of shape [2, 3, 4]
( sequence length x batch size x vocab) into an embedding layer of [4,5],
mathematically I expect python to broadcast this over the non-matrix dimension ,
which in this case is 2. Now shouldn’t my output be of the shape 2 x 3 x5?
Instead I get 2 x 3 x4 x 5, matrix multiplication wise this is weird…do you know why this happens?

It seems dim2 is some kind of one-hot encoded vocab?
If so, you should rather pass the vocab indices in a shape of [2, 3] to your embedding layer.
The result will then be [2, 3, 5] as expected.

1 Like

@ptrblck Thank you for your help once again! Yes, dim2 is a one-hot encoded vocab. Ah… so you are basically saying not to one hot encode it prior to embedding?

Yes, instead pass the index values (4 instead of [0, 0, 0, 0, 1]), e.g. like in the target using nn.CrossEntropyLoss.

1 Like

I have tried padding these input sequences, however because Pytorch’s packed_padding requires sorting and order matters for this dataset. I am stuck on how this works.

inputs = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]

Any ideas on how to maintain order with nested sequences like these?
@ptrblck

Could you explain a bit more how these values should be sorted?
The order won’t be changed, if you use my (fixed) code snippet:

target = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]

max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_cols - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)
1 Like

Thanks again @ptrblck I am looking at visit data by customer. So within each list is a list of visits for a specific customer. Ex.

customer 1( two visits) : [[1, 2, 2], [1, 2, 2, 3, 4]] 
customer 2( 1 visit):  [[8, 9, 10]]
customer 3 ( two visits): [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]

Within each visit are items ordered. So customer 1 ordered items [1,2,2] on visit 1 and items [1,2,2,3,4] on visit two.

If I one-hot encode it this is the order I expect to get and this works well… but it is manual

tensor([[
      Batch 1:   [0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
                    [0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0.],
                    [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.]],

     Batch 2:   [[0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
                     [0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0.]]])

Here the first visit of each customer is in the first batch and the second visit of each customer is in the second and so on… However, if I manually code this up I would have to create a mask etc… which is fine, but I am trying to try out Pytorch’s pack_padding approach instead. With the intent of getting the visits order maintained. How should I a nested list of visits?

Here is the encoding code:

seqs = inputs
lengths = np.array([len(seq) for seq in seqs]) - 1 # remove the last list in each cutomers's sequences for labels
n_samples = len(lengths)
maxlen = np.max(lengths)

x = torch.FloatTensor(torch.zeros(maxlen, n_samples, 12)) # maxlen = number of visits, n_samples = samples
y = torch.FloatTensor(torch.zeros(maxlen, n_samples, 12))
for idx, (seq,label) in enumerate(zip(seqs,labels)):
    for xi, visit in zip(x[:,idx,:], seq[:-1]):
        xi[visit] = 1.
    for yi, visit in zip(y[:,idx,:], label[1:]):
        yi[visit] = 1.

Thank you in advance!!

@ptrblck I ended up flattening the inner list of each sequence before padding and that seems to work.

[[1, 2, 2, 1, 2, 2, 3, 4], [8, 9, 10], [1, 2, 2, 3, 4, 1, 2, 2, 5, 6, 7]]

Since, the order is maintained within each customer’s sequence. Any thoughts?

While this might work, you will lose the different visits, won’t you?
I’m sure there is a better way to achieve the result you want, but this code should create your desired tensor:

target = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]


max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_cols - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)

padded = padded.permute(1, 0, 2)  # permute so that batch dim is dim0

size = padded.size()
padded[padded == 0] = padded.max() + 1  # add pseudo index
res = torch.zeros(*size[:2], padded.max()+1).scatter_(2, padded, 1)
res = res[:, :, :-1]  # remove pseudo index
print(res)
> tensor([[[0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.],
         [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.]],

        [[0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 0.]]])

Thanks again @ptrblck, I know I would lose visits unfortunately. However, is there no way to embed the encoded inputs without having to multi-one hot? Sorry for the confusion. But, I already have the one-hot solution, I want to instead use pytorch to directly embed the inputs before packing…is there any easy way to do this?

target = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]

Are you saying that the easiest way out is to just use the one hot approach?

Sorry, I might be slow today, but I’m not sure what you mean by “embed”.
I assume you are not talking about an embedding like the nn.Embedding module.

Maybe let’s first clarify what your goal and input data is and we can have a look at the utils.rnn methods and see, if they provide ready methods. :wink:

1 Like

No problem. My input data is in a nest list of variable lengths. I want to directly use the encoded input list in the following steps:

input = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]
  1. Pad these inputs:
padded = tensor([[ 1,  2,  2,  0,  0,  0],
        [ 1,  2,  2,  3,  4,  0],
        [ 8,  9, 10,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 1,  2,  2,  3,  4,  0],
        [ 1,  2,  2,  5,  6,  7]])
  1. Use this embed this padded tensor : embs = nn.Embedding(vocab, embsize)

  2. Pack : pack_padded_sequence(embs, seq_lengths.cpu().numpy())

and use it in a RNN…My question is what is the best way to deal with data of this format? Should I just one-hot encode it and make a custom model from scratch? Or can I use Pytorch’s pack_padded_sequence? However, the problem is the sorting step needed to use this torch util.

2 Likes

I have basically the same input and expected output, looking forward for an excellent solution @ptrblck thank you very much

Could you post some dummy input and output?
Is the proposed approach in this topic not working for some reason?

thanks for your response.
The situation is a little different here.
For the input, I have records for a batch of users in the shape of max_length*batch_size, and each element is a list itself representing the items the use choose in the corresponding time step. Each column represents the records for a user.
For example, a input could looks like this:
input=[
[[3,5,4], [8,5], [3]],
[[6], [6,4,3,5], [7,5,3]],
[[6,5],[2],[2]]
[[2],[0],[0]]
]
here the max_length=4 (the first column), the batch_size=3, and the sequence_length=[4, 3, 3] for the three users. All elements are lists with different lengths, representing different items a use choose once. As you can see, they are zero-padded. The first user take items forth, for the first time, he chooses [3,5,4], and the second time he chooses [6], then [6,5], and at last I put a [2] as the end-of-sequence token.

I want to use nn.Embedding to embed each element and take the average as the input of the RNN in a time step, so my expect output would looks like this with shape=max_lengthbatch_sizeembedding_dim:
output=[
[[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2]],
[[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2]],
[[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2]]
]
with sequence_length=[4,3,3], I could then employ torch.nn.utils.rnn.pack_padded_sequence to get the input for a RNN.

I can’t directly use nn.Embedding(input) since the input has irregular shape.

So what is the most “pytorch” way to do this ?
Thank you very much.