Hello all, I am having trouble understanding how the dataloader works internally, especially when we define the number of workers. I noticed a weird behavior and made a minimal code snippet replicating the issue. Here is the dataset class.
class testClass(Dataset):
def __init__(self):
pass
def __len__(self):
return 1000
def __getitem__(self, item):
print("Accessing the __getitem__ method")
return torch.rand(10)
I am calling the class as and testing as follows -
dataset = testClass()
dataloader = DataLoader(dataset, batch_size=5, num_workers=5)
for _, data in enumerate(dataloader):
print(data.shape)
print('--------------------------------------------------')
break
when my batch_size is 5, the output is as follows -
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
torch.Size([5, 10])
--------------------------------------------------
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
There are 2 things I am not able to understand. firstly, why is the __getitem__
method being called so many times, since my batch is 5, I expect it to be called only 5 times. Secondly, as you can see the why is it being called after I have printed the dashed lines, I have already received my first batch of data and added a break
. Nothing should be printed after the dashed lines I suppose. Also the behaviour is not always the same, it sometimes prints or not prints after the dashed lines.
This is also the issue when I specify num_workers=1
and batch_size=1
. The output for this combination is as follows-
Accessing the __getitem__ method
Accessing the __getitem__ method
torch.Size([1, 10])
--------------------------------------------------
Again it is being called twice.
The only time I notice the expected behavior is when I do not pass the num_workers
argument. For example for batch_size=1
and not passing num_workers
the output is as follows -
Accessing the __getitem__ method
torch.Size([1, 10])
--------------------------------------------------
and for batch_size=5
the output is as follows(Again not passing num_workers
argument when calling Dataloader).
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
Accessing the __getitem__ method
torch.Size([5, 10])
--------------------------------------------------
When dealing with my original issue, I realized that I was loading gibberish data all along, that is wrong labels corresponding to input data…
What am I doing wrong here?