Sparse DDP w/ gloo broken in 2.1.0?

I’ve made a github issue here, but it’s possible that this forum is the better place to discuss it.

The TL;DR is that we have a model with sparse embeddings training with DDP via pytorch lightning and ray. NCCL doesn’t support allreduce on sparse vectors (yet), so we use the gloo backend. This was working fine until we recently attempted to update to torch 2.1.0, now we get an error:

RuntimeError: Backend gloodoes not support allreduce

This comes from this change. I’m not well-versed enough in torch internals or GPU programming to understand the finer points of the PR, but it seems odd that it should uniformly disable allreduce_sparse with this TORCH_CHECK(False condition that springs this runtime error.

Am I misunderstanding something, or is this a bug?

Just saw this forum post, definitely a bug and thank you for surfacing it! Will follow up through github

1 Like