My training sometimes stucks oddly. It stops printing.
And the GPU utils is 0%.
When I press Crtl + C to terminate it, it prints log like this.
It seems that the error occurs when loading data by multiprocessing.
Anyone knows why?
The verison of some modules in my code:
I use the docker interactively, and use screen to run in background.