Data loader unexpected behaviour

elidor22 · October 27, 2022, 9:55am

Currently working on a custom dataloader and it is showing an unexpected behavior. If the dataloader is called as an enumerator and the values are received using next(enumerate(dataloader)) it will work without any issues. However if called in a loop than the problems start.
The dataset code is as follows:

class CustomDataset(Dataset):
    def __init__(self, csv_path):
        self.paths_df = pd.read_csv(csv_path)

    def __len__(self):
        return len(self.paths_df)

    def __getitem__(self, idx):
        
        id = self.paths_df.id[idx]
        target =  y = np.float32(self.paths_df.target[idx])
        filename = f'../input/g2net-detecting-continuous-gravitational-waves/train/{id}.hdf5'
        img = np.empty((2, 360, 128), dtype=np.float32)
        
        with h5py.File(filename, 'r') as f:
            g = f[id]

            for ch, s in enumerate(['H1', 'L1']):
                a = g[s]['SFTs'][:, :4096] * 1e22  # Fourier coefficient complex64

                p = a.real**2 + a.imag**2  # power
                p /= np.mean(p)  # normalize
#                 print(p.shape)
                p = np.mean(p.reshape(360, 128, 32), axis=2)  # compress 4096 -> 128
                img[ch] = p

            
#         print(img.shape)
        return img, y

The following code section works without any issue:

dataset = CustomDataset('path_to_csv/file.csv')
train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=1 ,shuffle=True, num_workers=4)
loop = enumerate(train_dataloader)
idx, data = next(loop)

However if the train_dataloader is called in a loop as follows:

train_dataloader1 = torch.utils.data.DataLoader(dataset, batch_size=32 ,shuffle=False, num_workers=2)
dataiter = iter(train_dataloader1) # creating a iterator

for data, label in train_dataloader1:
    print(label)

It will give an error as showed below:
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File “/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py”, line 287, in _worker_loop
data = fetcher.fetch(index)
File “/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py”, line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py”, line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/tmp/ipykernel_27/4159678793.py”, line 24, in getitem
p = np.mean(p.reshape(360, 128, 32), axis=2) # compress 4096 → 128
ValueError: cannot reshape array of size 258480 into shape (360,128,32)

The issue comes from the array reshaping part and occurs only when calling the dataloader in a loop as then it returns a batch which most likely makes the reshape operation fail. How can this issue be avoided and why does makes it happen(as it seems specific to the way the Pytorch handles the dataloader).

srishti-git1110 · October 27, 2022, 11:04am

Hi Elidor,

Could you please uncomment this and post the shape of p as I’m not sure why batching would cause this error.
I would think the items are first fetched from the __getitem__ method (subsequent calls are made) and then batched together before returning as a batch. I might be wrong here so I’ll cross check the source code of DataLoader meanwhile.

elidor22 · October 27, 2022, 11:46am

It seems that the code is fine. The dataset contains some data that has a different shape. That issue gives the error above as the code doesn’t take into account the possibility of different size tensors(or np arrays as used above). So at the end this is not a Pytorch issue, and the code runs well using next() as it doesn’t get to process the wrong shaped array.