[Question] Do large-scale training workloads hit storage/runtime GC bottlenecks during data loading?

Hey folks, first time to be here, and nice to join the community! :waving_hand:

I just want to ask in model training, have you seen data-loading bottlenecks caused by storage client/gateway memory allocation, GC, or RSS growth? How did you diagnose it, and would a Rust-based stack solve the problem once for all tho?

Please share your thoughts and any possible solutions!

1 Like