Issue with example stacking in batch in DataLoader

Hi all, I am having an issue with the division of my training set in batches using torch.utils.data.DataLoader.
I reported the error message at the end of the thread and attached it as an image.
I am working with a torchvision.datasets.DatasetFolder of numpy arrays of dimension (128,128,3) to train a Generative Adversarial Network with progressive growing training starting from 4x4 resolution. These are the functions I am using to create the dataset and the loader

def npy_loader(path):
sample = np.load(path, allow_pickle=True)
return sample

def get_loader(image_size):
transform = transforms.Compose(
[
CustomNormalization(), # normalization of the inputs
NumpyToTensor3D(), # output tensor is in the shape (C x H x W) so it works with transforms.Resize()
transforms.Resize((image_size, image_size)),
]
)
batch_size = 64
dataset = datasets.DatasetFolder(root=config.DATASET, loader=npy_loader, extensions=‘.npy’, transform=transform)
loader = DataLoader(
dataset,
batch_size=batch_size,
num_workers=4,
pin_memory=True,
)
return loader, dataset

The error happens in the first epoch when arriving at the batch where there is the stack problem because of a different shape in an example. The loop through the batches is defined in the following way:

loop = tqdm(loader, leave=True)
for batch_idx, (real, _) in enumerate(loop):

As you can see there are 2 custom-made transformation classes, I tried running the code without them and I got the same error, so apparently, they are not the problem. I also tried running the code without the transforms.Resize() class and I still got the error.

I also tried to change the batch size to 32 and 128 but still got the error and always at the same example of the training set (n. 403756). I have checked the training set several times and the original shape of each file is correct.
If I set the batch size to 1 the error does not happen anymore, so I am thinking that probably there is some mistake in the definition of the DataLoader so that the stacking of the examples in batches fails because of dimension.
Also, in this case, I am running the training with approximately 1.600.000 examples in the training set but if I reduce the training set to a smaller sample (around 70.000 examples) the training is performed well and I don’t get the error. Could the dataset dimension be a problem? It does not seem possible to me.

Has anyone else experienced a similar problem? Do you have any suggestions?
Thanks in advance for your help

Traceback (most recent call last):
File “/home/b/b382153/PrOGAN_pytorch/train.py”, line 197, in
main ()
File " /home/b/b382153/ProGAN_pytorch/train.py", line 174, in main
tensorboard_step, alpha = train_fn(
File “/home/b/b382153/PrOGAN_pytorch/train.py”, line 70, in train_fn
for batch_idx, (real, -) in enumerate (loop):
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-×86_64-wuuo72/lib/python3.10/site-packages/tqdm/std.py”,line1195,in iter. for obj in iterable:
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-x86_64-wuuo72/lib/python3.10/site-packages/torch/utils/data/dataloader.py”,line628,in
-_next
data = self._next_data()
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-×86_64-wuuo72/lib/python3.10/site-packages/torch/utils/data/dataloader.py”,line1333,in_next_data return self._process_data(data)
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-×86_64-wuuo72/lib/python3.10/site-packages/torch/utils/data/dataloader.py”,line1359,in_process_data data.reraise ()
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-×86_64-wuuo72/lib/python3.10/site-packages/torch/_utils.py”,line543,inreraise raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-×86_64-wuuo72/1ib/python3.10/site-packages/torch/utils/data/_utils/worker.py”,line302,in_worker_100p
data = fetcher. fetch (index)
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-×86_64-wuuo72/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py”,line61,infetch return self.collate_fn (data)
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-×86_64-wuuo72/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py”,line265,indefault_collate return collate (batch, collate_fn_map=default_collate_fn_map)
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-×86_64-wuuo72/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py”,line143,incollate return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-×86_64-wuuo72/lib/python3.10/site-packages/torch/utils/data/_utils/collate•py”,line143,in
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-×86_64-wuuo72/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py”,line120,incollate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
File “/sw/spack-levante/mambaforge-22.9.0-2-Linux-x86_64-wuuo72/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py”,line163,incollate_tensor_fn return torch. stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [3, 4, 4] at entry 0 and [481, 128, 4, 4] at entry 44

Do you recognize the shapes reported in the error message and if so which one would be expected?

Assuming your Dataset.__getitem__ returns an image tensor I would guess the first one would be valid even though the spatial size of 4x4 seems to be quite small.
What is each sample supposed to return, in particular number of channels and spatial size?

1 Like

Hey @ptrblck, thanks for replying.
Yes, the first shape [3, 4, 4] is the correct one. The 4x4 is indeed small but it is because of the way the progressive growing during training is done. It starts from a low resolution of the inputs and progressively shifts to higher resolutions, so there’s a 4x4 phase of training, then an 8x8, then 16x16 and so on until 128x128. This method gives different advantages during the training phase.
Each sample is supposed to be of the shape [3, 4, 4] so for a batch size of 32 the shape of the batches will be [32, 3, 4, 4].

Thanks for the description. Could you add debug transformations showing the shape of the input to your custom transforms to narrow down if any of your transforms is changing the shape?
Something like this might work:

class PrintTransform:
    def __call__(self, x):
        if isinstance(x, PIL.Image.Image):
            print(x.size)
        elif isinstance(x, torch.Tensor) or isinstance(x, np.ndarray):
            print(x.shape)
        return x


transform = transforms.Compose([
    PrintTransform(),
    transforms.ToTensor(),
    transforms.Resize((224, 224)),
    PrintTransform(),
    transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)),
    PrintTransform()
])

img = transforms.ToPILImage()(torch.randint(0, 256, (3, 256, 256)).byte())
out = transform(img)

but you might need to add more checks in case you are using other input classes.

1 Like

Hey @ptrblck I ran the code removing all the transformations and the error keeps appearing so it seems like they do not cause it. Could it be something with the definition of the dataset with datasets.DatasetFolder, I debugged the code more and the unwanted change of shape happens there.

Yes, this would be my next suggestion and you could check the shape of each sample returned by your Dataset. If needed, you could also create a custom Dataset deriving from ImageFolder which would allow you to return each sample path with the corresponding data or you could also try to add debugging print statements directly into the ImageFolder class by manipulating the Python script. You can check the location of this file via print(torch.__path__) which should point to the root folder where your current PyTorch binary is installed.

1 Like

Okay, thank you very much for the help. So, despite working with numpy arrays you still suggested using ImageFolder rather than DatasetFolder?

I don’t think it would make a difference since ImageFolder would additionally create the self.imgs attribute as seen here, so you can also use the DatasetFolder class directly.

1 Like