Errors when num_workers is set to be value bigger than 0 in torch.utils.data.DataLoader

I have a dataset whose labels are from 0 to 39. And i wrap it using torch.utils.data.DataLoader, if i set num_workers to be 0, everything works fine, However if it is set to be 2, then the labels batch (a 1-D byte tensor)it loads at some epoch always seems to be bigger than 39, which seems to be 255. what causes this problem? any help? ( P.S. my dataset is .h5 file).

2 Likes

maybe your cpu only have 1 worker

hi, guy, what do you mean by one worker? i used to run on the same machine using 2 workers with other project, and it is fine. By the way, the code works fine using 2 workers util some random epoch of trianing when it output labels with value 255 to stop my training. I guess the code here may cause the problem. any idea

below is my code.

from __future__ import print_function
import torch.utils.data as data
import os
import os.path
import errno
import torch
import json
import h5py

from IPython.core.debugger import Tracer 
debug_here = Tracer() 

import numpy as np
import sys

import json


class Modelnet40_V12_Dataset(data.Dataset):
    def __init__(self, data_dir, image_size = 224, train=True):
        self.image_size = image_size
        self.data_dir = data_dir
        self.train = train 

        file_path = os.path.join(self.data_dir, 'modelnet40.h5')
        self.modelnet40_data = h5py.File(file_path)
        
        if self.train: 
            self.train_data = self.modelnet40_data['train']['data']
            self.train_labels = self.modelnet40_data['train']['label']
        else:
            self.test_data = self.modelnet40_data['test']['data']
            self.test_labels = self.modelnet40_data['test']['label']

    def __getitem__(self, index):
        if self.train:
            shape_12v, label = self.train_data[index], self.train_labels[index] 
        else:
            shape_12v, label = self.test_data[index], self.test_labels[index]
        return shape_12v, label 


    def __len__(self):
        if self.train:
            return self.train_data.shape[0]
        else:
            return self.test_data.shape[0]

if __name__ == '__main__':
    print('test')
    train_dataset = Modelnet40_V12_Dataset(data_dir='path/data', train=True)
    print(len(train_dataset))

    test_dataset = Modelnet40_V12_Dataset(data_dir='path/data', train=False)
    print(len(test_dataset))
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=8,
                                     shuffle=True, num_workers=2)


    total = 0 
    # debug_here() 
    # check when to cause labels error 
    for epoch in range(200):
        print('epoch', epoch)
        for i, (input_v, labels) in enumerate(train_loader):
            total = total + labels.size(0)

            # labels can be 255, what is the problem??
            if labels.max() > 40: 
                debug_here() 
                print('error')

            if labels.min() < 1:
                debug_here()  
                print('error')
            
            labels.sub_(1) # minus 1 in place 
            
            if labels.max() >= 40: 
                debug_here() 
                print('error')

            if labels.min() < 0:
                debug_here()  
                print('error')
        print(total)

can someone give some help ??

if your data is numpy.array, you can try like this

self.train_data = torch.from_numpy(self.modelnet40_data['train']['data'].value)
1 Like

yes, my data is indeed numpy array, i will try it and see if it works

Thank you very much, it solves my problem.

hi, this could partly solve my problem. because this method loads all the data into the memory. However, when the dataset is big(in .h5 file), this it is impractical to load all the data to the memory. donot it?? And the problem still exists.

yes, this method loads all the data into memory. If the data is large, I guess you can do this way. (I don’t try this)

def __getitem__(self, index):
    if self.train:
    shape_12v, label = self.modelnet40_data['train']['data'][index], self.modelnet40_data['train']['label'][index]

I don’t know if it works, you can tell me if you try.

hi, i tried this one, but it still doesnot work. And I suspect this issue is related to the multi-thread synchronization issues in dataloader class.

sorry I don’t know how to solve this one

I have been seeing similar problems with DataLoader when num_workers is greater than 1. My per sample label is [1, 0 …, 0] array. When loading a batch of samples, most of the labels are OK, but I could get something like [70, 250, …, 90] in one row. This problem does not exist when num_workers=1.

Any solution or suggestions?

I have also met similar problems. Does anyone can figure out how to solve it? Thanks a lot!

This is always the case if you are using Windows (in my computer).
Try it from the command line, not from Jupyter.

Thanks. But I use pytorch in Linux(Archlinux), and the version of pytorch is 0.2-post2.

From the command line or from Jupyter/PyCharm?

Hi @QuantScientist, I run my code from command line.

Can you share your full source code so that I can try it and see that it works on my system?

I have the same problem! How do you solve the problem?
Besides, seemingly there is little anserwers about that.

This might be related to those two issues. What version of PyTorch are you using? Perhaps updating to 0.4.1 might help.