Problem with fork-like multiprocess Dataloader on Ubuntu

A week ago after tons of work on code base I noticed a segmentation fault problem right before starting the main loop of training process.
I found that training interrupts on fetching a batch from train dataloader. Interenstingly that some timm models like ese_vovnet and mobilenetv3_large_075 doesn’t work absolutely but some models like mobilenet_large_100 sometime work especially if I use EMA but not every time. If the first epoch can start, then the training process lasts til the end (many tens epochs) without any problems every time. I assume that there occurs a deadlock during processes creating.
If I place a code multiprocessing.set_start_method('spawn') in __main__ all problems dissapear except one: training time rises three times becauses time that is taken to create every worker increases and I can see from profiling that _io.BufferedWriter.write() takes much time. In total a time to enable multiprocessing (_io.BufferedWriter.write(), select.pool.pool(), _thread.lock.acquire()) three-four times exceeds a time taken by whole training stuff (mainly imread(), torch.conv2d(), run_backward()).

So I would like to use ‘fork’ method but I dont know what to start from. I don’t use homemade multiprocessing in my code. Only framework Dataloaders and some functions like pandarallel use multiprocessing.
I have some experience in C++ threads but I’ve never worked with multiprocessing in Python. How to tackle such kind of problem?