How to create dataloader with variable length 3D arrays without padding, then nn.AdaptiveAvgPool3d?

Johnny_Tam · August 22, 2022, 5:35am

I have a embed_arr_list which is a list of 3d arrays with variable 3rd dimension shape, like this:

for arr in embed_arr_list:
    print(arr.shape)

(50, 128, 331)
(50, 128, 331)
(50, 128, 331)
(50, 128, 331)
(50, 128, 201)
(50, 128, 201)
(50, 128, 532)
(50, 128, 532)
(50, 128, 532)
(50, 128, 532)

len(embed_arr_list)
10

I also have a target_list, containing the targets (floats).

len(target_list)
10

Then, I want to build a dataloader

import torch
import numpy as np
from torch.utils.data import TensorDataset, DataLoader

tensor_x = torch.Tensor(embed_arr_list) # transform to torch tensor
tensor_y = torch.Tensor(target_list)

my_dataset = TensorDataset(tensor_x,tensor_y) # create your datset
my_dataloader = DataLoader(my_dataset) # create your dataloader

However, I got the following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-84-f293256d52d8> in <module>
      3 from torch.utils.data import TensorDataset, DataLoader
      4 
----> 5 tensor_x = torch.Tensor(embed_arr_list) # transform to torch tensor
      6 tensor_y = torch.Tensor(target_list)
      7 

ValueError: expected sequence of length 331 at dim 3 (got 201)

May I know how to feed the variable length 3d arrays to the dataloader without using padding?

Because I want to use nn.AdaptiveAvgPool3d, which I understand should adapt to variable input dimensions.

Thank you!

ptrblck · August 23, 2022, 12:29am

I assume you either want to return a single sample from the __getitem__ method of your Dataset and apply the pooling layer afterwards, apply the pooling inside the __getitem__, or return a list with samples of different shapes from __getitem__ and apply the pooling on each sample inside the training loop.
In all cases you should write a custom Dataset and implement your logic there as TensorDataset expects tensors as its input, which will not work since you won’t be able to stack/concatenate the samples to a single tensor. (Nested tensors could work, but I think this util. is still work in progress.)

Johnny_Tam · August 23, 2022, 12:35am

Hi @ptrblck ! If I do the second option (i.e. return a list with samples of different shapes from __getitem__ and apply the pooling on each sample inside the training loop.), does it mean each training cycle will use one sample only? I think it will make the training quite slow (?).

ptrblck · August 23, 2022, 12:46am

Yes, that’s correct if the pooling layer is inside the model. However, I thought you would like to apply the pooling layer on the samples first before passing them to the model:

for data_list in loader:
    x = []
    for d in data_list:
        x.append(pool(d))
    x = torch.stack(x)
    out = model(x)