in case of model parallel training, does the computational graph also store which device contains the required weights? (I think it is yes)
This depends on which training paradigm you are using:
Local: PyTorch eager mode builds the autograd graph during forward pass. If the forward pass uses multiple devices on the same machine, there will be copy operators representing the Device-to-Device tensor copy. Those copy operators will be recorded in the autograd graph. So yep, in this case, the graph has the device information.
SPMD: For Single-Program Multi-Data style distributed/parallel training, usually each GPU device is kind-of “independent” to other GPUs in terms of forward/backward computations. Different training paradigms, such as DDP/FSDP will use different communications to synchronize those devices. DDP AllReduces gradients after backward computation, while FSDP AllGather full parameter tensors before computation and ReduceScatter gradients after backward computation. So, in this case, the autograd graph does not records remote device information. Instead, it relies on ProcessGroup and NCCL/Gloo comm libraries to talk to other devices.
MPMD: For Multi-Program Multi-Data style distributed training (e.g., PyTorch RPC), there will be a distributed autograd graph that spans multiple processes and remote GPUs. It attempts to treat a remote GPU as-if it’s a local GPU.
what does pytorch do if data is remotely cached (there are 2 options here: invalidate all cached copies and only update the original data or update both original and cached copies)
If you are referring to input data, this is usually handled by data loaders. As input is read-only, we don’t need to worry about cached copies.
If you are referring to model parameters, there are two different styles:
- Synchronous: in every iteration, all trainers will read the param from previous iteration, generate gradients locally, synchronize gradients globally, and then update parameter. So there won’t be staleness here.
- Asynchronous: each trainer might proceed at its own pace, with some optional global staleness bound. This is an open research area. So, no fixed answer for this one.
Also GPU L1 and L2 caches are really small, they are usually not large enough to even hold an individual tensor. It’s usually up to operator kernel implementation to decide how to utilize shared memory and global memory from GPU. Every kernel will read input from CUDA global memory and write output to CUDA global memory. If we treat kernel design and distributed training design as two different domains, L1/L2 cache optimization might belong to the former.