Hi,
I’m using pytorch 0.4 on windows 10 and until now I always used num_workers=0 in my dataloader. in the project I’m currently working on data loading is pretty heavy so I tried setting num_workers>1 (say 4).
it work’s great, except that at the beginning of every training epoch, or when I switch from train to test, the GPU gets idle for some time and then resumes. It usually takes about a minute.
I’m not an expert on multiprocessing with since it only happens at the beginning of a train\test operation I suspect it has something to do with re-creating all the workers.
has anyone ever encountered such behavior? what can I do to solve this?
it doesn’t sound like a long time but my epochs are short (roughly 5 minutes) so a minute of wait for every epoch increases epoch time by ~20% …
Usually the first iteration takes a bit more time, because all workers have to load a batch.
Do you have a lot of pre-processing or very large samples?
Hi, thanks for responding.
actually i have almost no pre-processing and my samples are audio files with duration of aprrox 1 second so it’s a 1X8000 tensor, not large in any way.
for some reason i believe the set-up of all the workers occurs every time i re-access a dataloader (be it train or test). i’m guessing it’s not a normal behavior.
would it help if i post my dataset code?
class AudioDataset(Dataset):
"""Dataset Class for Loading audio files"""
def __init__(self, DataDir, timeDepth):
self.DataDir = DataDir
self.timeDepth = timeDepth
self.Audio_Sample_Length = 319
self.sampling_rate = 8000
self.GlobalFrameRate = 25.2462
self.audio_duration = 122
self.quantization_channels = 256
self.audio_len_in_frames = math.floor(self.audio_duration*self.sampling_rate/self.Audio_Sample_Length)
self.normalize = True
# create short audio samples from long wav files
raw_audio = [x for x in os.listdir(self.DataDir) if ".wav" in x]
labels_stack = []
labels_pkl_file = os.path.join(self.DataDir, "labels.pkl")
if not os.path.exists(labels_pkl_file):
piece_num = 0
for i ,f in enumerate(raw_audio):
print("processing audio file %s" % f)
full_path = os.path.join(self.DataDir,f)
audio, Fs = librosa.load(full_path,sr=self.sampling_rate,duration=self.audio_duration)
if self.normalize:
audio = librosa.util.normalize(audio)
labels = get_ground_truth("Speaker%d" % (i + 1), self.GlobalFrameRate, self.audio_duration)
for end_frame in range(self.timeDepth,self.audio_len_in_frames):
audio_seq = audio[(end_frame - self.timeDepth) * self.Audio_Sample_Length:end_frame * self.Audio_Sample_Length]
if self.normalize:
audio_seq = librosa.util.normalize(audio_seq)
audio_path = self.DataDir + 'samples/sample_' + str(piece_num) + '.wav'
librosa.output.write_wav(audio_path, audio_seq, sr=self.sampling_rate)
piece_num += 1
label = torch.LongTensor(1)
label[0] = int(labels[end_frame])
labels_stack.append(label)
labels_dump = open(labels_pkl_file, 'wb')
pickle.dump(labels_stack, labels_dump)
labels = open(labels_pkl_file, 'rb')
labels = pickle.load(labels)
self.audio_labels = labels
def __len__(self):
return len(self.audio_labels)
def __getitem__(self, idx):
audio_path = audio_path = self.DataDir + 'samples/sample_' + str(idx) + '.wav'
audio, Fs = librosa.load(audio_path,sr=self.sampling_rate)
sample = torch.from_numpy(audio).type(torch.FloatTensor)
return sample, self.audio_labels[idx].type(torch.FloatTensor)
during init I break some long wav files into short wav files (only happens once) which I later load in the get_item() method. the rest is pretty trivial i think…
Skimming through your code I couldn’t find any obvious errors.
You mentioned that the time peaks when changing from training to eval.
Is it constant without this change? Could you check that?
at the beginning of every iteration of the training loop posted above.
at the beginning of a test operation (very similar to the training code above, just iterating on a test dataloader)
just to be clear - even if i don’t switch from train to eval at all and just run the training loop, i still wait at the beginning of every training epoch
otherwise, during an epoch the GPU is constantly loaded and working at a constant pace.
Hi,
this issue is still occurring but I found a workaround for now and I’m posting it here in case it will be useful to someone else -
apparently, librosa.load() is quite slow, and when I switched to scipy.io.wavfile.read() I got more than 10X speedup! since the load operation was 90% of the __getitem__ method, I can now run with num_workers = 0 and the GPU is fully loaded all the time.
however, I still find the behavior mentioned earlier to be strange…
I notice exactly the same behaviour and cannot speed up my __getitem__ method anymore. Timewise it almost seems like the dataloader would call its __init__ method again each epoch which clearly wouldn’t make any sense though. Is there any solution to this problem?
The __init__ method of Dataset should not be called.
However, the data stored in the Dataset will be copied to all processes.
Are you pre-loading your data in the __init__? If so, are you seeing increased memory usage using multiple workers?
Would lazy loading not work in your case?
Thanks for the reply! You were right, I loaded the data in the __init__ in order to keep it in memory. Since the dataset is quite large (approximately 12gb) the problem you mentioned occurred. I tried lazy loading and it does work and kinda solves the problem, thank you for that! Still, isn’t copying the data to all processes weird as it is already in the ram in the first place? Shouldn’t the dataloader just give markers to the subprocesses which then get their samples from the data loaded with __init__ ? Because otherwise I would load 8 times the same 12gb dataset into the ram, right?
Good to hear it was this issue!
Well, it might sound a bit weird, but it’s the default mechanism using multiprocessing.
In case you are lazily loading your data it’s not a big deal.
However, if you preload the data, this might cause an issue.
Generally I’m not sure, if you get a big performance improvement using multiple workers, if the data is already loaded.
That being said, you could use shared memory, which would do exactly as you’ve suggested.
I’ve created a small example yesterday for exactly this use case here. Maybe you could use it as well?
Thank you! I The main reason I do this is because the data is not stored as individual images but as several arrays. Additionally I want to globally normalize my data and since the dataset might change in the future I want to do this directly before starting the training. The reason why I need more than one worker is because I need to perform some quite heavy online data augmentation and one worker cannot feed the gpu fast enough. Now (after your hint) I am loading the data, normalize it, save each image as separate file, delete the array and then lazily load each image when iterating. This is kinda stupid but nevertheless super fast. Problem solved! But since I have quite a substantial amount of memory available (256gb) using it seems to be the smartest option. Using your code I wrote the following function to make an existing array a shared array
Traceback (most recent call last):
File "train.py", line 182, in <module>
main()
File "train.py", line 73, in main
filter_size=opt.filter_size, augment=opt.augment, marginfactor=opt.marginfactor)
File "C:\Users\TestUser\Documents\eliaseulig\DeepDSA\auxiliaries.py", line 228, in __init__
self.stacks = make_mp_array(self.stacks)
File "C:\Users\TestUser\Documents\eliaseulig\DeepDSA\auxiliaries.py", line 174, in make_mp_array
shared_array_base = mp.Array(ctypes.c_float,array)
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\context.py", line 140, in Array
ctx=self.get_context())
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\sharedctypes.py", line 87, in Array
obj = RawArray(typecode_or_type, size_or_initializer)
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\sharedctypes.py", line 64, in RawArray
type_ = type_ * len(size_or_initializer)
OverflowError: The '_length_' attribute is too large
which might be due to the size of the array I try to process (2130,1024,1024). Is this the problem although my ram is definitely sufficiently large?
Of course I could also setup my ram as a regular volume, save the images there and have approximately the same loading speed as if it was actually on ram, but a solution with a shared array would be much more beautiful. Thank you very much for your great help!
I’m not sure, how Windows handles shared memory, but could it be that it has some limitations?
I couldn’t find anything useful regarding Windows and shared memory.
There was recently a bug fix for negative sizes in cpython (issue), but I think it’s unrelated to your issue.
I also tried to use .share_memory_() on the torch tensors of my data (as suggested by @apaszke and @trypag) and then access themin the getitem method. This works and is very fast when the dataset is small but when increasing the dataset size at one point I get the following error which I -again- can’t understand. I still have plenty of ram free so this shouldn’t be the issue.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\spawn.py", line 106, in spawn_main
exitcode = _main(fd)
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\spawn.py", line 116, in _main
self = pickle.load(from_parent)
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\site-packages\torch\multiprocessing\reductions.py", line 86, in rebuild_storage_filename
storage = cls._new_shared_filename(manager, handle, size)
RuntimeError: Couldn't map view of shared file <torch_12164_1041449738>, error code: <87> at ..\src\TH\THAllocator.c:226
Could it be that nearly all my problems are windows related and I should just switch to linux at one point?