since i am not able to adjust the share memory usage in the remote server, can we disable share memory usage in pytorch. the same experiment run with tensorflow without shm size problem, so i just want to find a solution for this problem.
I’m not a specialist on shared memory but from what I remember, it is only used if you explicitly send tensors across processes. So reducing these would solve your problem. I don’t think we support other ways to transfer tensors in multiprocessing. I guess you could save and load from disk?
i am using a distributed job with distributeddataparallel , i just do not clearly got what you mean. as for as i know, pytorch use share memory in dataloader ?
@albanD yes , i am sure dataloader with use share memory as default for multiprocessing dataloader.
this is what i found in pytorch source code. torch.utils.data.dataloader.py
97 def _worker_loop(dataset, index_queue, data_queue, done_event, collate_fn, seed, init_fn, worker_id):
98 # See NOTE [ Data Loader Multiprocessing Shutdown Logic ] for details on the
99 # logic of this function.
102 global _use_shared_memory
103 _use_shared_memory = True
105 # Intialize C side signal handlers for SIGBUS and SIGSEGV. Python signal
106 # module’s handlers are executed after Python returns from C low-level
107 # handlers, likely when the same fatal signal happened again already.
108 # https://docs.python.org/3/library/signal.html Sec. 184.108.40.206
117 if init_fn is not None:
120 watchdog = ManagerWatchdog()
122 while watchdog.is_alive():
124 r = index_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
125 except queue.Empty:
127 if r is None:
128 # Received the final signal
129 assert done_event.is_set()
131 elif done_event.is_set():
132 # Done event is set. But I haven’t received the final signal
133 # (None) yet. I will keep continuing until get it, and skip the
134 # processing steps.
136 idx, batch_indices = r
138 samples = collate_fn([dataset[i] for i in batch_indices])
139 except Exception:
140 # It is important that we don’t store exc_info in a variable,
141 # see NOTE [ Python Traceback Reference Cycle Problem ]
142 data_queue.put((idx, ExceptionWrapper(sys.exc_info())))
144 data_queue.put((idx, samples))
145 del samples
146 except KeyboardInterrupt:
147 # Main process will raise KeyboardInterrupt anyways.
Does setting the number of workers to
0 for the dataloader fix the error?
i am not sure, but i can not set the number_works to 0, for the data loading is the bottleneck in our video task. we need to set for each gpu at least 8 number_worker to make sure the data loading will not increase the training time a lot .
I meant as a test, to confirm that this is where the error is coming from !
i have test to set number_worker to 1, and the problem disappear. i can also make sure that this problem is related to batch_size , image_size and number_workers at least.
I’m not sure there is any way to perform the sharing between the loading processes and the main one that can replace the shared memory.
You might have to reduce the number of workers if you cannot increase the shared memory.
Maybe @smth has some other ideas to overcome this?