I found that fault tolerance is achieved by restarting,
- The number of restarts is maintained in the memory of each training node and is not the same as the number of failures
- If a card is dropped, the Pod will be restarted continuously until the maximum number of restarts is reached
I would like to ask if there are other students who have encountered the same problem, and if the community has any methods to deal with this problem