Dataloader with num_workers>1 hangs every epoch

I’m using pytorch 0.4 on windows 10 and until now I always used num_workers=0 in my dataloader. in the project I’m currently working on data loading is pretty heavy so I tried setting num_workers>1 (say 4).
it work’s great, except that at the beginning of every training epoch, or when I switch from train to test, the GPU gets idle for some time and then resumes. It usually takes about a minute.
I’m not an expert on multiprocessing with since it only happens at the beginning of a train\test operation I suspect it has something to do with re-creating all the workers.
has anyone ever encountered such behavior? what can I do to solve this?
it doesn’t sound like a long time but my epochs are short (roughly 5 minutes) so a minute of wait for every epoch increases epoch time by ~20% …


Usually the first iteration takes a bit more time, because all workers have to load a batch.
Do you have a lot of pre-processing or very large samples?

Hi, thanks for responding.
actually i have almost no pre-processing and my samples are audio files with duration of aprrox 1 second so it’s a 1X8000 tensor, not large in any way.
for some reason i believe the set-up of all the workers occurs every time i re-access a dataloader (be it train or test). i’m guessing it’s not a normal behavior.
would it help if i post my dataset code?

Sure, your Dataset code and the training loop would be interesting.

class AudioDataset(Dataset):

    """Dataset Class for Loading audio files"""

    def __init__(self, DataDir, timeDepth):

        self.DataDir = DataDir
        self.timeDepth = timeDepth
        self.Audio_Sample_Length = 319
        self.sampling_rate = 8000
        self.GlobalFrameRate = 25.2462
        self.audio_duration = 122
        self.quantization_channels = 256
        self.audio_len_in_frames = math.floor(self.audio_duration*self.sampling_rate/self.Audio_Sample_Length)
        self.normalize = True

        # create short audio samples from long wav files

        raw_audio = [x for x in os.listdir(self.DataDir) if ".wav" in x]

        labels_stack = []

        labels_pkl_file = os.path.join(self.DataDir, "labels.pkl")

        if not os.path.exists(labels_pkl_file):

            piece_num = 0
            for i ,f in enumerate(raw_audio):
                print("processing audio file %s" % f)
                full_path = os.path.join(self.DataDir,f)
                audio, Fs = librosa.load(full_path,sr=self.sampling_rate,duration=self.audio_duration)
                if self.normalize:
                    audio = librosa.util.normalize(audio)

                labels = get_ground_truth("Speaker%d" % (i + 1), self.GlobalFrameRate, self.audio_duration)

                for end_frame in range(self.timeDepth,self.audio_len_in_frames):
                    audio_seq = audio[(end_frame - self.timeDepth) * self.Audio_Sample_Length:end_frame * self.Audio_Sample_Length]
                    if self.normalize:
                        audio_seq = librosa.util.normalize(audio_seq)
                    audio_path = self.DataDir + 'samples/sample_' + str(piece_num) + '.wav'
                    librosa.output.write_wav(audio_path, audio_seq, sr=self.sampling_rate)
                    piece_num += 1

                    label = torch.LongTensor(1)
                    label[0] = int(labels[end_frame])

            labels_dump = open(labels_pkl_file, 'wb')
            pickle.dump(labels_stack, labels_dump)

        labels = open(labels_pkl_file, 'rb')
        labels = pickle.load(labels)

        self.audio_labels = labels

    def __len__(self):

        return len(self.audio_labels)

    def __getitem__(self, idx):

        audio_path = audio_path = self.DataDir + 'samples/sample_' + str(idx) + '.wav'
        audio, Fs = librosa.load(audio_path,sr=self.sampling_rate)
        sample = torch.from_numpy(audio).type(torch.FloatTensor)
        return sample, self.audio_labels[idx].type(torch.FloatTensor)

during init I break some long wav files into short wav files (only happens once) which I later load in the get_item() method. the rest is pretty trivial i think…

and my main training loop:

    for epoch in range(args.num_epochs):

        # train for one epoch
        batch_time = AverageMeter()
        data_time = AverageMeter()
        accuracy = 0

        states = (Variable(torch.zeros(args.lstm_layers, args.batch_size, args.lstm_hidden_size)).cuda(),
                  Variable(torch.zeros(args.lstm_layers, args.batch_size, args.lstm_hidden_size)).cuda())

        for i, data in enumerate(train_loader):

            states = utils.detach(states)

            audio, target = data  # input is of shape torch.Size([batch, channels, frames, width, height])

            audio_var = utils.to_var(audio.unsqueeze(1))
            target_var = utils.to_var(target.squeeze())

            output, states = net(audio_var, states)
            loss = criterion(output.squeeze(), target_var)

            # measure accuracy and record loss
            predicted = torch.ceil(,1)
            accuracy = (predicted.type(torch.LongTensor) == target.cuda().type(torch.LongTensor)).sum()

            prec1 = (100.0 * accuracy / args.batch_size)
            losses.update(loss.item(), args.batch_size)
            top1.update(prec1.item(), args.batch_size)

            # compute gradient and do SGD step
            torchutils.clip_grad_norm_(net.parameters(), 0.5)

            # tensorboard logging
            logger.scalar_summary('loss', loss.item(), step + 1)
            logger.scalar_summary('train accuracy', prec1.item(), step + 1)

            if i % args.print_freq == 0:
                print('Epoch: [{0}][{1}/{2}] , \t'
                      'LR {3} , \t'
                      'Time {batch_time.avg:.3f} , \t'
                      'Data {data_time.avg:.3f} , \t'
                      'Loss {loss.avg:.4f} , \t'
                      'Acc {top1.avg:.3f}'.format(
                    epoch, i, len(train_loader), optimizer.param_groups[0]['lr'], batch_time=batch_time,
                    data_time=data_time, loss=losses, top1=top1))

Skimming through your code I couldn’t find any obvious errors.
You mentioned that the time peaks when changing from training to eval.
Is it constant without this change? Could you check that?

the “idle time” occurs in two scenarios:

  1. at the beginning of every iteration of the training loop posted above.
  2. at the beginning of a test operation (very similar to the training code above, just iterating on a test dataloader)

just to be clear - even if i don’t switch from train to eval at all and just run the training loop, i still wait at the beginning of every training epoch

otherwise, during an epoch the GPU is constantly loaded and working at a constant pace.


here you can see the GPU’s idle time (clocks, load, memory) between two iterations.

Could you time the librosa loading? The DataLoader isn’t re-creating the Dataset, so it would come down to the __getitem__ method.

of course.
I timed it and it usually takes 0.08-0.1 seconds.
the entire __getitem__ method takes roughly the same.

my dataset consists of 10K samples, I’m working with batch_size of 16 and num_workers = 8 (happens for other numbers of workers except 0)

btw, I really appreciate you trying to help me with this. it’s not a show stopper, just a little annoying :blush:

this issue is still occurring but I found a workaround for now and I’m posting it here in case it will be useful to someone else -
apparently, librosa.load() is quite slow, and when I switched to I got more than 10X speedup! since the load operation was 90% of the __getitem__ method, I can now run with num_workers = 0 and the GPU is fully loaded all the time.

however, I still find the behavior mentioned earlier to be strange…

1 Like

I notice exactly the same behaviour and cannot speed up my __getitem__ method anymore. Timewise it almost seems like the dataloader would call its __init__ method again each epoch which clearly wouldn’t make any sense though. Is there any solution to this problem?

The __init__ method of Dataset should not be called.
However, the data stored in the Dataset will be copied to all processes.
Are you pre-loading your data in the __init__? If so, are you seeing increased memory usage using multiple workers?
Would lazy loading not work in your case?

Thanks for the reply! You were right, I loaded the data in the __init__ in order to keep it in memory. Since the dataset is quite large (approximately 12gb) the problem you mentioned occurred. I tried lazy loading and it does work and kinda solves the problem, thank you for that! Still, isn’t copying the data to all processes weird as it is already in the ram in the first place? Shouldn’t the dataloader just give markers to the subprocesses which then get their samples from the data loaded with __init__ ? Because otherwise I would load 8 times the same 12gb dataset into the ram, right?

Good to hear it was this issue!
Well, it might sound a bit weird, but it’s the default mechanism using multiprocessing.
In case you are lazily loading your data it’s not a big deal.

However, if you preload the data, this might cause an issue.
Generally I’m not sure, if you get a big performance improvement using multiple workers, if the data is already loaded.

That being said, you could use shared memory, which would do exactly as you’ve suggested.
I’ve created a small example yesterday for exactly this use case here. Maybe you could use it as well?

Thank you! I The main reason I do this is because the data is not stored as individual images but as several arrays. Additionally I want to globally normalize my data and since the dataset might change in the future I want to do this directly before starting the training. The reason why I need more than one worker is because I need to perform some quite heavy online data augmentation and one worker cannot feed the gpu fast enough. Now (after your hint) I am loading the data, normalize it, save each image as separate file, delete the array and then lazily load each image when iterating. This is kinda stupid but nevertheless super fast. Problem solved! But since I have quite a substantial amount of memory available (256gb) using it seems to be the smartest option. Using your code I wrote the following function to make an existing array a shared array

def make_mp_array(array):
    array_shape = array.shape
    array.shape = 1*array.size
    shared_array_base = mp.Array(ctypes.c_float,array)
    shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())

    return shared_array.reshape(array_shape)

Unfortunately I get the following error

Traceback (most recent call last):
  File "", line 182, in <module>
  File "", line 73, in main
    filter_size=opt.filter_size, augment=opt.augment, marginfactor=opt.marginfactor)
  File "C:\Users\TestUser\Documents\eliaseulig\DeepDSA\", line 228, in __init__
    self.stacks = make_mp_array(self.stacks)
  File "C:\Users\TestUser\Documents\eliaseulig\DeepDSA\", line 174, in make_mp_array
    shared_array_base = mp.Array(ctypes.c_float,array)
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\", line 140, in Array
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\", line 87, in Array
    obj = RawArray(typecode_or_type, size_or_initializer)
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\", line 64, in RawArray
    type_ = type_ * len(size_or_initializer)
OverflowError: The '_length_' attribute is too large

which might be due to the size of the array I try to process (2130,1024,1024). Is this the problem although my ram is definitely sufficiently large?
Of course I could also setup my ram as a regular volume, save the images there and have approximately the same loading speed as if it was actually on ram, but a solution with a shared array would be much more beautiful. Thank you very much for your great help!

I’m not sure, how Windows handles shared memory, but could it be that it has some limitations?
I couldn’t find anything useful regarding Windows and shared memory.

There was recently a bug fix for negative sizes in cpython (issue), but I think it’s unrelated to your issue.

I also tried to use .share_memory_() on the torch tensors of my data (as suggested by @apaszke and @trypag) and then access themin the getitem method. This works and is very fast when the dataset is small but when increasing the dataset size at one point I get the following error which I -again- can’t understand. I still have plenty of ram free so this shouldn’t be the issue.

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\", line 106, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\", line 116, in _main
    self = pickle.load(from_parent)
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\site-packages\torch\multiprocessing\", line 86, in rebuild_storage_filename
    storage = cls._new_shared_filename(manager, handle, size)
RuntimeError: Couldn't map view of shared file <torch_12164_1041449738>, error code: <87> at ..\src\TH\THAllocator.c:226

Could it be that nearly all my problems are windows related and I should just switch to linux at one point?

Thank you so much, num_workers=0 works

1 Like