DataLoader not workin properly

I am using custom dataset class and then using data loader to load my data.

train_loader = torch.utils.data.DataLoader(
        dataset_train,
        sampler=train_sampler,
        batch_size=args.batch_size_per_gpu,
        num_workers=args.num_workers,
        pin_memory=True,
    )
    print(f"Data loaded with {len(dataset_train)} train imgs.")

The length of the dataset is 85 images.

print("for training model,size of train loader", len(train_loader.dataset))

But when I iterate through the data loader, it only divides into 2 batches of 16 and 6 images (batch size of 16) so only 22 images.I don’t know what happens with the rest of the images.I am using DINO tool’s eval_linear script https://github.com/facebookresearch/dino/blob/main/eval_linear.py.

I had to modify that code for my custom dataset as well.So it looks like this:

def train(model, linear_classifier, optimizer, train_loader, epoch, n, avgpool):
    linear_classifier.train()
    metric_logger = utils.MetricLogger(delimiter="  ")
    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
    header = 'Epoch: [{}]'.format(epoch)
    all_labels = []


    for it,sample in enumerate(metric_logger.log_every(train_loader, 10)):
   
        #print('type of sample is ...',sample)
        images = sample['image']
 
        ## extract label
        index = sample['label']
        all_labels.append(index)
        print('label is ...',index)
        # move to gpu
        images = images.cuda(non_blocking=True)
        ## index is the target/label value
        index = index.cuda(non_blocking=True)

Interestingly the code works perfectly fine with torch version 1.7.0 and torch metrics 0.8.2 but due to GPU incompatibility , I needed to upgrade to both tools and this throws weird challenge.

I even provide batch size of 85 but it only iterates through 22 images.The length of the dataset in the data loader just before iterating is 85 images, so I don’t know why the rest of the images are not trained and the batches are not divided accordingly.There should be 5 batches with 16 images and the last batch with 5 images, so total 6 batches.

I also tried changing the print_freq of metric logger but it also did not help.

Can anyone provide any suggestion?

I tried checking the number of batches using the code in this post.

print('expected number of batches {}'.format(
    math.ceil(N / batch_size)))

And it printed out 6 batches as well.So not sure where the rest of the batches go?

Also I checked another example when I load 573 images, 144 images are trained(batch size 16, 9 batches). This equals 144 images. If we look at the ratio 144/573 its 25% approx, so is the ratio of 22/85 = 25%. So the rest of 75% images how can get those back?

The length of the dataset (and thus total number of samples) is irrelevant if you provide a custom sampler.
Here is an example:

dataset = torch.utils.data.TensorDataset(torch.randn(100, 1))
print(len(dataset))
# 100

loader = DataLoader(dataset, batch_size=10)
print(len(loader.dataset))
# 100
print(len(loader.sampler))
# 100
print(len(loader))
# 10

class MySampler(torch.utils.data.sampler.Sampler):
    def __init__(self, length):
        self.length = length
        
    def __iter__(self):
        return iter(range(self.length))
    
    def __len__(self):
        return self.length

sampler = MySampler(1)
loader = DataLoader(dataset, batch_size=10, sampler=sampler)
print(len(loader.dataset))
# 100
print(len(loader.sampler))
# 1
print(len(loader))
# 1

so check how train_sampler is defined.

So I think I am not using custom sampler, This is how is defined:
train_sampler = torch.utils.data.distributed.DistributedSampler(dataset_train)

How can I change this so as to it uses all of the images?

What is the length of the sampler?
DistributedSampler splits the dataset using the world size so that each rank only loads its subset.

Also if I check the length of my loader itself its returned 2 when it should be 6, in the case of 85 images with batch size 16.So I don’t think sampler affects this, it might be the issue with loader?

No, as already explained and seen in my code snippet. I explicitly defined the sampler to have a length of 1 which then also returns a single batch containing a single sample. You can copy/paste my code snippet and run it or you could just check the len of the train_sampler, which will most likely be smaller than 85 images.

So the length of the train_sampler is 1.