When the DTensor dispatcher encounters an op that with a mixture of Tensor and DTensor arguments, the following error is raised:
File "/home/.../torch/distributed/_tensor/_dispatch.py", line 354, in try_get_replicate_spec
raise RuntimeError(
RuntimeError: aten.embedding.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!
This is annoying since models from HuggingFace use tensors created at runtime which are not set as module attributes, and hence can’t be casted to DTensor before a forward pass. This behaviour can be avoided by setting the self._allow_implicit_replication
to True in the OpDispatcher at torch/distributed/_tensor/_dispatch.py
, which currently requires either changing local installation files or forking torch. Can this be changed to something like the following?
self._allow_implicit_replication = os.environ["TORCH_DTENSOR_ALLOW_IMPLICIT_REPLICATION"]