I am experiencing a strange behavior of the dataloader when I import some custom modules at the beginning of my training script. After the custom scripts are loaded and when the training starts (or after few iterations) i encounter: "RuntimeError: DataLoader worker (pid 7326) is killed by signal: Floating point exception. " when the worker tries to get the new batch.
If I don’t import everything runs fine, if I set num workers = 0, the network trains but extremely slowly.
I have spent quite a lot of time trying to import my custom modules differently (at the beginning I was doing sys.append, but now I have changed everything to “import folder.subfolder.script_name”, which seems more correct. However this does not solve the problem.
It is very hard to debug for me, could you give me some hints? @ptrblck or @albanD ?
This is surprising indeed.
With 0 workers, does it goes through a full epoch without issues (even though it’s slow)? Could it be that one sample is problematic?
I’m afraid I don’t have a silver bullet idea to debug this.
What I would do is like an ablation study. Remove stuff in your Dataset code until it stops crashing. (you can make the training loop a no-op and just load sample and discard them to make this run faster). I would be curious to know what is the minimal change that cause this!
thanks so much for your fast response,
I did not try to run a full epoch with 0 workers, because it is really slow. I will try to identify exactly where this happens in the code, and I will try to run a full epoch with 0 workers.
I think the dataset is fine because this does not seem to happen when I remove my custom modules.
My best guess is that since my custom scripts involve loading a module for inference, and the module is loaded in the init method of a class, this somehow interfere with the training script. Even though the loaded modules are only used after the training is completed.
thanks again, I will come back to you!
I will also try to
@albanD, I finally found the issue!
I have used tensorboard for logging during training, I use this repository TensorFlowLog to export the logs. If I do a relative import of TensorFlowLog from the main.py everything is fine, however if I do a relative import of the same module from another file in another directory it creates problems. For another directory I mean a directory in the project but not a child directory of the main.py.
Simply, absolute import of TensorFlowLog solved the issue!
EDIT: it actually did not solve the issue. I accidentally remove some lines during debugging. The issue is still present. My workaround is to import tensorboard.backend.everything at the very last part of the code, I really hate this solution but found nothing better.
This bug, I think, was hard to detect: 1 failing silently during the importing phase, 2 generating “floating point error” during training, which seems completely unrelated to the actual issue.
After little experience, I think that tensorboard is very buggy in my setup. I was thinking maybe It would be better to manually save logs by incrementally update a dataframe and save as it csv. I just need to to save some scalars.
What could be for you a simple and clean way to save training logs?(Loss, accuracy, validation…)
Thanks so much,
Nice catch. That indeed very far from anything I could have guessed haha
I don’t have experience with the latest logging solutions
If you only want to log numerical values though, you might want to check vanilla python options that might be more mature?