Sparse=True Error in distributed training

rajamohan_reddy · June 4, 2022, 5:06am

RuntimeError: Cannot access data pointer of Tensor that doesn’t have storage

I am training model in using distributeddataparallel when I set sparse=true in nn.embedding. I am getting above error.

please help me in resolving above error
I am training model via Sagemaker pytorch estimator and smddp as backend

wanchaol · June 7, 2022, 5:02am

Thanks for posting the question @rajamohan_reddy This seems not a DDP problem, but an optimizer problem. As you can see from the note of the doc, sparse gradients only supported by certain optimizers (not all optimizer). Did you specify some optimizers that does not support sparse gradients? Embedding — PyTorch 1.11.0 documentation