Fault-tolerance for large scale training

Hello community!

I’m hoping to get some recommendations regarding fault-tolerance in large scale training.

From what I’ve read, one practice is to restart the job when a failure is detected and resuming from the last saved checkpoint. Then the problem becomes that how frequent the checkpoint can be saved and how fast the checkpoint can be loaded and I’ve seen efforts from the pytorch community on that like async checkpointing.

Besides that, wondering if there’re other methods that can be leveraged. For example, can torchelastic be used for large scale training when there’s more than data parallelism but also TP and PP? From torchelastic’s documentation, it does not seem like the case but I was hoping to get more clarifications from the experts here.

Any advice is appreciated!