Num_works cause insufficient shared memory

ran_wu · March 2, 2022, 3:38am

I try to training imagenet on four 3090 graph cards with 4096 batchsize. However, when I set num_works bigger than 8, this erorr shut down my training program.

ran_wu · March 2, 2022, 3:40am

ptrblck · March 2, 2022, 6:09am

Did you try to increase your shared memory limit or checked which limit is currently set?

ran_wu · March 2, 2022, 8:20am

I am working on a leased server. The shared memory of virtual environment is 20G. I can’t change this setting. Is there any method to avoid this issue?

ptrblck · March 2, 2022, 8:21am

Assuming you are indeed using all 20GB of it, I think the only workaround would be to reduce the shared memory usage by reducing the number of workers.