What causes Zombie processes?

After several training steps, the script begins to evaluate the performance of the model, but it gets stuck in the evaluation step. When checking the process, some zombie processes appear.

I’ve set the num_workers=0 and the same problem occurs.

In linux terms, zombie process is when a process dies and its leftovers are not cleaned up by the parent-process/system. They are cleaned up eventually after sometime.

In other words, zombie process should be cleaned up by the system and scheduling programs. Therefore, the training process getting stuck is due to the failure of my system.

Thanks, I need to check my operating system.

It depends on how the processes which are evaluating are handled in terms of their life cycle. Its normal to have zombie process in a normal running system. Important to note is that zombie process means that process is no more active. You can check if the process which has turned into a zombie process died gracefully or because of some error. It will help to find the root cause.

But zombie process does not means system failure

Could you please tell me how to achieve this?