DataLoader not workin properly

Rohan_Bareja · October 6, 2023, 7:05pm

I am using custom dataset class and then using data loader to load my data.

train_loader = torch.utils.data.DataLoader(
        dataset_train,
        sampler=train_sampler,
        batch_size=args.batch_size_per_gpu,
        num_workers=args.num_workers,
        pin_memory=True,
    )
    print(f"Data loaded with {len(dataset_train)} train imgs.")

The length of the dataset is 85 images.

print("for training model,size of train loader", len(train_loader.dataset))

But when I iterate through the data loader, it only divides into 2 batches of 16 and 6 images (batch size of 16) so only 22 images.I don’t know what happens with the rest of the images.I am using DINO tool’s eval_linear script https://github.com/facebookresearch/dino/blob/main/eval_linear.py.

I had to modify that code for my custom dataset as well.So it looks like this:

def train(model, linear_classifier, optimizer, train_loader, epoch, n, avgpool):
    linear_classifier.train()
    metric_logger = utils.MetricLogger(delimiter="  ")
    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
    header = 'Epoch: [{}]'.format(epoch)
    all_labels = []


    for it,sample in enumerate(metric_logger.log_every(train_loader, 10)):
   
        #print('type of sample is ...',sample)
        images = sample['image']
 
        ## extract label
        index = sample['label']
        all_labels.append(index)
        print('label is ...',index)
        # move to gpu
        images = images.cuda(non_blocking=True)
        ## index is the target/label value
        index = index.cuda(non_blocking=True)

Interestingly the code works perfectly fine with torch version 1.7.0 and torch metrics 0.8.2 but due to GPU incompatibility , I needed to upgrade to both tools and this throws weird challenge.

I even provide batch size of 85 but it only iterates through 22 images.The length of the dataset in the data loader just before iterating is 85 images, so I don’t know why the rest of the images are not trained and the batches are not divided accordingly.There should be 5 batches with 16 images and the last batch with 5 images, so total 6 batches.

I also tried changing the print_freq of metric logger but it also did not help.

Can anyone provide any suggestion?

Rohan_Bareja · October 6, 2023, 8:58pm

I tried checking the number of batches using the code in this post.

print('expected number of batches {}'.format(
    math.ceil(N / batch_size)))

And it printed out 6 batches as well.So not sure where the rest of the batches go?

Also I checked another example when I load 573 images, 144 images are trained(batch size 16, 9 batches). This equals 144 images. If we look at the ratio 144/573 its 25% approx, so is the ratio of 22/85 = 25%. So the rest of 75% images how can get those back?

ptrblck · October 6, 2023, 9:29pm

The length of the dataset (and thus total number of samples) is irrelevant if you provide a custom sampler.
Here is an example:

dataset = torch.utils.data.TensorDataset(torch.randn(100, 1))
print(len(dataset))
# 100

loader = DataLoader(dataset, batch_size=10)
print(len(loader.dataset))
# 100
print(len(loader.sampler))
# 100
print(len(loader))
# 10

class MySampler(torch.utils.data.sampler.Sampler):
    def __init__(self, length):
        self.length = length
        
    def __iter__(self):
        return iter(range(self.length))
    
    def __len__(self):
        return self.length

sampler = MySampler(1)
loader = DataLoader(dataset, batch_size=10, sampler=sampler)
print(len(loader.dataset))
# 100
print(len(loader.sampler))
# 1
print(len(loader))
# 1

so check how train_sampler is defined.

Rohan_Bareja · October 6, 2023, 9:33pm

So I think I am not using custom sampler, This is how is defined:
train_sampler = torch.utils.data.distributed.DistributedSampler(dataset_train)

How can I change this so as to it uses all of the images?

ptrblck · October 6, 2023, 9:36pm

What is the length of the sampler?
DistributedSampler splits the dataset using the world size so that each rank only loads its subset.

Rohan_Bareja · October 6, 2023, 9:44pm

Also if I check the length of my loader itself its returned 2 when it should be 6, in the case of 85 images with batch size 16.So I don’t think sampler affects this, it might be the issue with loader?

ptrblck · October 6, 2023, 9:45pm

No, as already explained and seen in my code snippet. I explicitly defined the sampler to have a length of 1 which then also returns a single batch containing a single sample. You can copy/paste my code snippet and run it or you could just check the len of the train_sampler, which will most likely be smaller than 85 images.

Rohan_Bareja · October 6, 2023, 9:48pm

So the length of the train_sampler is 1.