Errors when num_workers is set to be value bigger than 0 in torch.utils.data.DataLoader

xinwei_he · June 24, 2017, 7:42am

I have a dataset whose labels are from 0 to 39. And i wrap it using torch.utils.data.DataLoader, if i set num_workers to be 0, everything works fine, However if it is set to be 2, then the labels batch (a 1-D byte tensor)it loads at some epoch always seems to be bigger than 39, which seems to be 255. what causes this problem? any help? ( P.S. my dataset is .h5 file).

SherlockLiao · June 24, 2017, 9:27am

maybe your cpu only have 1 worker

xinwei_he · June 24, 2017, 9:39am

hi, guy, what do you mean by one worker? i used to run on the same machine using 2 workers with other project, and it is fine. By the way, the code works fine using 2 workers util some random epoch of trianing when it output labels with value 255 to stop my training. I guess the code here may cause the problem. any idea

xinwei_he · June 24, 2017, 9:46am

below is my code.

from __future__ import print_function
import torch.utils.data as data
import os
import os.path
import errno
import torch
import json
import h5py

from IPython.core.debugger import Tracer 
debug_here = Tracer() 

import numpy as np
import sys

import json


class Modelnet40_V12_Dataset(data.Dataset):
    def __init__(self, data_dir, image_size = 224, train=True):
        self.image_size = image_size
        self.data_dir = data_dir
        self.train = train 

        file_path = os.path.join(self.data_dir, 'modelnet40.h5')
        self.modelnet40_data = h5py.File(file_path)
        
        if self.train: 
            self.train_data = self.modelnet40_data['train']['data']
            self.train_labels = self.modelnet40_data['train']['label']
        else:
            self.test_data = self.modelnet40_data['test']['data']
            self.test_labels = self.modelnet40_data['test']['label']

    def __getitem__(self, index):
        if self.train:
            shape_12v, label = self.train_data[index], self.train_labels[index] 
        else:
            shape_12v, label = self.test_data[index], self.test_labels[index]
        return shape_12v, label 


    def __len__(self):
        if self.train:
            return self.train_data.shape[0]
        else:
            return self.test_data.shape[0]

if __name__ == '__main__':
    print('test')
    train_dataset = Modelnet40_V12_Dataset(data_dir='path/data', train=True)
    print(len(train_dataset))

    test_dataset = Modelnet40_V12_Dataset(data_dir='path/data', train=False)
    print(len(test_dataset))
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=8,
                                     shuffle=True, num_workers=2)


    total = 0 
    # debug_here() 
    # check when to cause labels error 
    for epoch in range(200):
        print('epoch', epoch)
        for i, (input_v, labels) in enumerate(train_loader):
            total = total + labels.size(0)

            # labels can be 255, what is the problem??
            if labels.max() > 40: 
                debug_here() 
                print('error')

            if labels.min() < 1:
                debug_here()  
                print('error')
            
            labels.sub_(1) # minus 1 in place 
            
            if labels.max() >= 40: 
                debug_here() 
                print('error')

            if labels.min() < 0:
                debug_here()  
                print('error')
        print(total)

can someone give some help ??

SherlockLiao · June 24, 2017, 4:53pm

if your data is numpy.array, you can try like this

self.train_data = torch.from_numpy(self.modelnet40_data['train']['data'].value)

xinwei_he · June 26, 2017, 1:26am

yes, my data is indeed numpy array, i will try it and see if it works

xinwei_he · June 26, 2017, 1:59am

Thank you very much, it solves my problem.

xinwei_he · June 27, 2017, 8:04am

hi, this could partly solve my problem. because this method loads all the data into the memory. However, when the dataset is big(in .h5 file), this it is impractical to load all the data to the memory. donot it?? And the problem still exists.

SherlockLiao · June 27, 2017, 9:04am

yes, this method loads all the data into memory. If the data is large, I guess you can do this way. (I don’t try this)

def __getitem__(self, index):
    if self.train:
    shape_12v, label = self.modelnet40_data['train']['data'][index], self.modelnet40_data['train']['label'][index]

I don’t know if it works, you can tell me if you try.

xinwei_he · June 27, 2017, 12:23pm

hi, i tried this one, but it still doesnot work. And I suspect this issue is related to the multi-thread synchronization issues in dataloader class.

SherlockLiao · June 27, 2017, 1:09pm

sorry I don’t know how to solve this one

bma · November 18, 2017, 5:13pm

I have been seeing similar problems with DataLoader when num_workers is greater than 1. My per sample label is [1, 0 …, 0] array. When loading a batch of samples, most of the labels are OK, but I could get something like [70, 250, …, 90] in one row. This problem does not exist when num_workers=1.

Any solution or suggestions?

cyyyyc123 · November 30, 2017, 12:36pm

I have also met similar problems. Does anyone can figure out how to solve it? Thanks a lot!

QuantScientist · November 30, 2017, 3:50pm

This is always the case if you are using Windows (in my computer).
Try it from the command line, not from Jupyter.

cyyyyc123 · December 1, 2017, 2:24am

Thanks. But I use pytorch in Linux(Archlinux), and the version of pytorch is 0.2-post2.

QuantScientist · December 1, 2017, 8:04am

From the command line or from Jupyter/PyCharm?

cyyyyc123 · December 2, 2017, 3:21am

Hi @QuantScientist, I run my code from command line.

QuantScientist · December 2, 2017, 9:32am

Can you share your full source code so that I can try it and see that it works on my system?

dreamyun · September 20, 2018, 7:42am

I have the same problem! How do you solve the problem?
Besides, seemingly there is little anserwers about that.

Florian_1990 · September 20, 2018, 9:06am

This might be related to those two issues. What version of PyTorch are you using? Perhaps updating to 0.4.1 might help.