Fault tolerance with k8s

Shuaijian_Wang · September 19, 2024, 8:45am

I found that fault tolerance is achieved by restarting,

The number of restarts is maintained in the memory of each training node and is not the same as the number of failures
If a card is dropped, the Pod will be restarted continuously until the maximum number of restarts is reached
I would like to ask if there are other students who have encountered the same problem, and if the community has any methods to deal with this problem