Nested list of variable length to a tensor

This code might work:

target = [[[1,2,3], [2,4,5,6]], [[1,2,3], [2,4,5,6], [2,4,6,7,8]]]
max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_length - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)
3 Likes

@ptrblck Thank you! it worked with a small bit of modifications:

target = batchY
max_length = max(len(row) for row in target)
max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_cols - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)

I also have one last question about how Pytorch embeddings work.
I often write my algorithms from scratch, but I am playing with using Pytorch’s built-ins.
However, lets say I pass an input tensor of shape [2, 3, 4]
( sequence length x batch size x vocab) into an embedding layer of [4,5],
mathematically I expect python to broadcast this over the non-matrix dimension ,
which in this case is 2. Now shouldn’t my output be of the shape 2 x 3 x5?
Instead I get 2 x 3 x4 x 5, matrix multiplication wise this is weird…do you know why this happens?

It seems dim2 is some kind of one-hot encoded vocab?
If so, you should rather pass the vocab indices in a shape of [2, 3] to your embedding layer.
The result will then be [2, 3, 5] as expected.

1 Like

@ptrblck Thank you for your help once again! Yes, dim2 is a one-hot encoded vocab. Ah… so you are basically saying not to one hot encode it prior to embedding?

Yes, instead pass the index values (4 instead of [0, 0, 0, 0, 1]), e.g. like in the target using nn.CrossEntropyLoss.

1 Like

I have tried padding these input sequences, however because Pytorch’s packed_padding requires sorting and order matters for this dataset. I am stuck on how this works.

inputs = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]

Any ideas on how to maintain order with nested sequences like these?
@ptrblck

Could you explain a bit more how these values should be sorted?
The order won’t be changed, if you use my (fixed) code snippet:

target = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]

max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_cols - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)
1 Like

Thanks again @ptrblck I am looking at visit data by customer. So within each list is a list of visits for a specific customer. Ex.

customer 1( two visits) : [[1, 2, 2], [1, 2, 2, 3, 4]] 
customer 2( 1 visit):  [[8, 9, 10]]
customer 3 ( two visits): [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]

Within each visit are items ordered. So customer 1 ordered items [1,2,2] on visit 1 and items [1,2,2,3,4] on visit two.

If I one-hot encode it this is the order I expect to get and this works well… but it is manual

tensor([[
      Batch 1:   [0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
                    [0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0.],
                    [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.]],

     Batch 2:   [[0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
                     [0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0.]]])

Here the first visit of each customer is in the first batch and the second visit of each customer is in the second and so on… However, if I manually code this up I would have to create a mask etc… which is fine, but I am trying to try out Pytorch’s pack_padding approach instead. With the intent of getting the visits order maintained. How should I a nested list of visits?

Here is the encoding code:

seqs = inputs
lengths = np.array([len(seq) for seq in seqs]) - 1 # remove the last list in each cutomers's sequences for labels
n_samples = len(lengths)
maxlen = np.max(lengths)

x = torch.FloatTensor(torch.zeros(maxlen, n_samples, 12)) # maxlen = number of visits, n_samples = samples
y = torch.FloatTensor(torch.zeros(maxlen, n_samples, 12))
for idx, (seq,label) in enumerate(zip(seqs,labels)):
    for xi, visit in zip(x[:,idx,:], seq[:-1]):
        xi[visit] = 1.
    for yi, visit in zip(y[:,idx,:], label[1:]):
        yi[visit] = 1.

Thank you in advance!!

@ptrblck I ended up flattening the inner list of each sequence before padding and that seems to work.

[[1, 2, 2, 1, 2, 2, 3, 4], [8, 9, 10], [1, 2, 2, 3, 4, 1, 2, 2, 5, 6, 7]]

Since, the order is maintained within each customer’s sequence. Any thoughts?

While this might work, you will lose the different visits, won’t you?
I’m sure there is a better way to achieve the result you want, but this code should create your desired tensor:

target = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]


max_cols = max([len(row) for batch in target for row in batch])
max_rows = max([len(batch) for batch in target])
padded = [batch + [[0] * (max_cols)] * (max_rows - len(batch)) for batch in target]
padded = torch.tensor([row + [0] * (max_cols - len(row)) for batch in padded for row in batch])
padded = padded.view(-1, max_rows, max_cols)

padded = padded.permute(1, 0, 2)  # permute so that batch dim is dim0

size = padded.size()
padded[padded == 0] = padded.max() + 1  # add pseudo index
res = torch.zeros(*size[:2], padded.max()+1).scatter_(2, padded, 1)
res = res[:, :, :-1]  # remove pseudo index
print(res)
> tensor([[[0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.],
         [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.]],

        [[0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 0.]]])

Thanks again @ptrblck, I know I would lose visits unfortunately. However, is there no way to embed the encoded inputs without having to multi-one hot? Sorry for the confusion. But, I already have the one-hot solution, I want to instead use pytorch to directly embed the inputs before packing…is there any easy way to do this?

target = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]

Are you saying that the easiest way out is to just use the one hot approach?

Sorry, I might be slow today, but I’m not sure what you mean by “embed”.
I assume you are not talking about an embedding like the nn.Embedding module.

Maybe let’s first clarify what your goal and input data is and we can have a look at the utils.rnn methods and see, if they provide ready methods. :wink:

1 Like

No problem. My input data is in a nest list of variable lengths. I want to directly use the encoded input list in the following steps:

input = [[[1, 2, 2], [1, 2, 2, 3, 4]],
 [[8, 9, 10]],
 [[1, 2, 2, 3, 4], [1, 2, 2, 5, 6, 7]]]
  1. Pad these inputs:
padded = tensor([[ 1,  2,  2,  0,  0,  0],
        [ 1,  2,  2,  3,  4,  0],
        [ 8,  9, 10,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 1,  2,  2,  3,  4,  0],
        [ 1,  2,  2,  5,  6,  7]])
  1. Use this embed this padded tensor : embs = nn.Embedding(vocab, embsize)

  2. Pack : pack_padded_sequence(embs, seq_lengths.cpu().numpy())

and use it in a RNN…My question is what is the best way to deal with data of this format? Should I just one-hot encode it and make a custom model from scratch? Or can I use Pytorch’s pack_padded_sequence? However, the problem is the sorting step needed to use this torch util.

2 Likes

I have basically the same input and expected output, looking forward for an excellent solution @ptrblck thank you very much

Could you post some dummy input and output?
Is the proposed approach in this topic not working for some reason?

thanks for your response.
The situation is a little different here.
For the input, I have records for a batch of users in the shape of max_length*batch_size, and each element is a list itself representing the items the use choose in the corresponding time step. Each column represents the records for a user.
For example, a input could looks like this:
input=[
[[3,5,4], [8,5], [3]],
[[6], [6,4,3,5], [7,5,3]],
[[6,5],[2],[2]]
[[2],[0],[0]]
]
here the max_length=4 (the first column), the batch_size=3, and the sequence_length=[4, 3, 3] for the three users. All elements are lists with different lengths, representing different items a use choose once. As you can see, they are zero-padded. The first user take items forth, for the first time, he chooses [3,5,4], and the second time he chooses [6], then [6,5], and at last I put a [2] as the end-of-sequence token.

I want to use nn.Embedding to embed each element and take the average as the input of the RNN in a time step, so my expect output would looks like this with shape=max_lengthbatch_sizeembedding_dim:
output=[
[[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2]],
[[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2]],
[[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2],
[0.3, 0,5, 0.6, 0.2]]
]
with sequence_length=[4,3,3], I could then employ torch.nn.utils.rnn.pack_padded_sequence to get the input for a RNN.

I can’t directly use nn.Embedding(input) since the input has irregular shape.

So what is the most “pytorch” way to do this ?
Thank you very much.

I might be a bit late to the party, but after realizing that pytorch won’t spoonfeed me anymore, I ended up writing my own function to pad a list of tensors.

The following function takes a nested list of integers and converts it into a padded tensor.

def ints_to_tensor(ints):
    """
    Converts a nested list of integers to a padded tensor.
    """
    if isinstance(ints, torch.Tensor):
        return ints
    if isinstance(ints, list):
        if isinstance(ints[0], int):
            return torch.LongTensor(ints)
        if isinstance(ints[0], torch.Tensor):
            return pad_tensors(ints)
        if isinstance(ints[0], list):
            return ints_to_tensor([ints_to_tensor(inti) for inti in ints])

This relies on another function pad_tensors described below:

def pad_tensors(tensors):
    """
    Takes a list of `N` M-dimensional tensors (M<4) and returns a padded tensor.

    The padded tensor is `M+1` dimensional with size `N, S1, S2, ..., SM`
    where `Si` is the maximum value of dimension `i` amongst all tensors.
    """
    rep = tensors[0]
    padded_dim = []
    for dim in range(rep.dim()):
        max_dim = max([tensor.size(dim) for tensor in tensors])
        padded_dim.append(max_dim)
    padded_dim = [len(tensors)] + padded_dim
    padded_tensor = torch.zeros(padded_dim)
    padded_tensor = padded_tensor.type_as(rep)
    for i, tensor in enumerate(tensors):
        size = list(tensor.size())
        if len(size) == 1:
            padded_tensor[i, :size[0]] = tensor
        elif len(size) == 2:
            padded_tensor[i, :size[0], :size[1]] = tensor
        elif len(size) == 3:
            padded_tensor[i, :size[0], :size[1], :size[2]] = tensor
        else:
            raise ValueError('Padding is supported for upto 3D tensors at max.')
    return padded_tensor

The pad_tensors function only supports tensors of upto 3 dimensions, but that can easily be extended. Using these functions should solve the issue.

As an example, here’s @rustytnt’s input:

In [4]: target = [[[1,2,3], [2,4,5,6]], [[1,2,3], [2,4,5,6], [2,4,6,7,8]]]

In [5]: ints_to_tensor(target)
Out[5]: 
tensor([[[1, 2, 3, 0, 0],
         [2, 4, 5, 6, 0],
         [0, 0, 0, 0, 0]],

        [[1, 2, 3, 0, 0],
         [2, 4, 5, 6, 0],
         [2, 4, 6, 7, 8]]])

And here’s @wangyanda’s input:

 In [6]: target = [[[3,5,4], [8,5], [3]], [[6], [6,4,3,5], [7,5,3]], [[6,5],[2],[2]], [[2],[0],[0]]]

In [7]: ints_to_tensor(target)
Out[7]: 
tensor([[[3, 5, 4, 0],
         [8, 5, 0, 0],
         [3, 0, 0, 0]],

        [[6, 0, 0, 0],
         [6, 4, 3, 5],
         [7, 5, 3, 0]],

        [[6, 5, 0, 0],
         [2, 0, 0, 0],
         [2, 0, 0, 0]],

        [[2, 0, 0, 0],
         [0, 0, 0, 0],
         [0, 0, 0, 0]]])
1 Like

Thank you very much!!! Works like a charm

Not sure if this helps but here is a solution I found on SO:

and here are solutions if the lengths are the same:

Hi, I’m definitely late to the party, but I had this exact problem with deeply nested lists (for instance, multiples pdfs of multiple pages of multiple lines of multiple words) and came up with this solution GitHub - aphp/foldedtensor: PyTorch extension for handling deeply nested sequences of variable length, which optimizes the conversion in C (it only handles nested python lists, not lists of tensors atm). Here is a small benchmark :

from foldedtensor import as_folded_tensor
import random
def make_nested_list(arg, *rest, value):
    size = random.randint(*arg) if isinstance(arg, tuple) else arg
    if not rest:
        return [value] * size 
    return [make_nested_list(*rest, value=value) for _ in range(size)]

Variable length nested lists

nested_list = make_nested_list(32, (50, 100), (25, 30), value=1)

%timeit ints_to_tensor(nested_list)
# 20.4 ms ± 241 µs per loop
%timeit as_folded_tensor(nested_list)
# 1.11 ms ± 33.9 µs per loop

The mask can be accessed via as_folded_tensor(nested_list).mask and the tensor can be “refolded” dynamically (flatten the second or third dimension for instance), e.g. tensor.refold(0, 2) to flatten the second dimension.

Same length nested lists (compatible with torch.tensor)

nested_list = make_nested_list(32, 100, 30, value=1)

%timeit torch.tensor(nested_list)
# 11.4 ms ± 435 µs per loop
%timeit ints_to_tensor(nested_list)
# 25.6 ms ± 1.57 ms per loop
%timeit as_folded_tensor(nested_list)
# 1.77 ms ± 37 µs per loop (faster than torch.tensor 🎉)

Simple list

nested_list = make_nested_list(10000, value=1)

%timeit torch.as_tensor(nested_list)
# 1.06 ms ± 42 µs per loop
%timeit ints_to_tensor(nested_list)
# 441 µs ± 14 µs per loop
%timeit as_folded_tensor(nested_list)
# 135 µs ± 2.86 µs per loop