Dataloader iterates through entire dataset

DrJellybean · December 26, 2020, 1:21am

I created a dataset that loads a single data sample at a time on demand (1 sample consists of multiple images), and I have a data loader with a small batch size. When I try to show just the first few batches of my dataset, the loader keeps trying to iterate through my entire dataset instead of pulling out just the few data samples:

class FaceDataset(Dataset):
    def __init__(self):
        df = pd.read_csv("data/positions.csv")
        df["filename"] = df["id"].astype("str") + ".jpg"

        self.filenames = df["filename"].tolist()
        self.targets = torch.FloatTensor(list(zip(df["x"], df["y"])))
        self.head_angle = torch.FloatTensor(df['head_angle'].tolist())

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, idx):
        sample = {
            "targets": self.targets[idx],
            "head_angle": self.head_angle[idx]
        }

        for img_type in ["face", "face_aligned", "l_eye", "r_eye", "head_pos"]:
            img = Image.open("data/{}/{}".format(img_type, self.filenames[idx]))
            img = torch.from_numpy(np.array(img))
            # img /= 255
            sample[img_type] = img

        return sample

ds = FaceDataset()
data = DataLoader(ds, batch_size=2, shuffle=True, num_workers=2)

for i_batch, sample_batched in enumerate(data):
    print(i_batch, sample_batched)

    if i_batch == 1:
        break

The length of my dataset is about 15k, and the loader seems to be trying to load everything instead of just a single batch of 2. Jupyter seems to just freeze before anything actually gets printed. Am I creating my dataset/loader incorrectly?

When I try to iterate over the dataset, I seem to get the correct data back for 1 training sample. The problem only seems to come in when iterating over the data loader object:

for batch in ds:
    print(batch)
    break

EDIT: I think Ive narrowed this down to the num_workers param. When I set it to 0 then the whole thing works as expected, but when I set it to > 0 then nothing gets printed?

ptrblck · December 31, 2020, 4:49am

Your small code snippet runs correctly on my machine:

dataset = TensorDataset(torch.randn(100, 1), torch.randn(100, 1))
loader = DataLoader(dataset, batch_size=2, shuffle=True, num_workers=2)

for batch_idx, batch in enumerate(loader):
    print(batch_idx, batch)
    if batch_idx == 1:
        break

> 0 [tensor([[ 0.7712],
          [-0.3303]]), tensor([[0.0505],
          [1.3738]])]
  1 [tensor([[-0.5104],
          [-0.5665]]), tensor([[-0.4806],
          [-0.7598]])]

Is the DataLoader with multiple workers working at all in your current setup and is only this small loop creating issues?