Dataloader shuffle

emma_ng · November 2, 2022, 6:29am

Hi, sorry for the naive questions, I’m just started learning.

I have a custom dataset to load image and its label, which is used with DataLoader with shuffle=True. My questions are:

When DataLoader shuffle the batches, do they shuffle both images and labels or just images? Because the answer here https://stackoverflow.com/questions/65402802/pytorch-shuffle-dataloader?rq=1 is saying that only the images are shuffled, not the label.
I need to accumulate the prediction output across the whole epoch to compute some specific scores of that epoch. My code is as below, I use accuracy here for simplicity. Is this correct to simply accumulate prediction output across all batches of train_loader and compute the scores at the end of the epoch?

Thank you

## Custom dataset class
class MyDataset(Dataset):
    def __init__(self, label_csv):
        self.label_df = pd.read_csv(label_csv)  # <img_id>, <label>

    def __len__(self):
        return len(self.label_df)

    def __getitem__(self, idx):
        img_id, label = self.label_df.iloc[idx]

        img = read_and_preprocess_image(img_id)

        return img, label

## Create datasets and dataloaders
training_data = MyDataset(train_label_csv)
train_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True, num_workers=8)
validation_data = MyDataset(val_label_csv)
val_dataloader = DataLoader(validation_data, batch_size=batch_size, shuffle=True, num_workers=8)

## Loop through all epochs
for epoch in range(num_epoch):
    running_loss = 0.0
    pred_epoch = []
    label_epoch = []

    for inputs, labels in train_loader:
        inputs = inputs.to(device)
        labels = labels.to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Make prediction
        outputs = model(inputs)

        _, preds = torch.max(outputs, 1)
        
        # Compute loss
        loss = loss_fn(outputs, labels)
        loss.backward()

        optimizer.step()

        # Accumulate labels
        running_loss += loss.detach() * inputs.size(0)
        pred_epoch.extend(preds.tolist())
        label_epoch.extend(labels.tolist())

    epoch_loss = running_loss / train_size
    epoch_correct = sum(pred_epoch[i] == label_epoch[i] for i in range(len(pred_epoch)))
    epoch_acc = epoch_correct / train_size

ptrblck · November 2, 2022, 6:34am

That’s not completely right, since the answer correctly pointed out that the shuffled predictions (created via shuffling the inputs via the DataLoader) are compared to the unshuffled targets (which were not created by the DataLoader):

train_accuracy = sum(train_preds.argmax(axis=1) == y_train)/len(y_train)

y_train is a global tensor, which was not shuffled, while train_preds was shuffled, so the answer is correct.

Now to your question: the DataLoader will shuffle both assuming the data and target are created in the Dataset.__getitem__ via the passed index (which is the default use case).

Your code looks correct assuming train_size is the length of the Dataset.

emma_ng · November 2, 2022, 6:37am

Oh, I see. Thank you for the quick response.
I missed the part about y_train that you mentioned.
Thank you so much.