Hi! I am newer to pytorch and am working with the distributed package.
I am trying to implement some parallelism functionalities but am running into some issues:
- I want to partition/shard a tensor. I know there’s some things hardcoded in that I also need to figure out, but for some reason dist.scatter is incorrectly producing [[1][3]] and [[2][3]] instead of [[1][3]] and [[2][4]]:
if local_rank == 0:
weight = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32).cuda(local_rank)
else:
weight = None
world_size = dist.get_world_size(m_view_2)
chunks = None
if local_rank == 0:
chunks = list(torch.chunk(weight, world_size, dim=1))
part_tensor = torch.empty((2, 1), dtype=torch.float32).cuda(local_rank) # need to figure out how to dynamically get shape
dist.scatter(part_tensor, scatter_list=chunks, src=0, group=m_view_2, async_op=False)
On a more general note, what i am trying to do is to take in a sequence of operations and execute those operations. For example:
- partition input tensor across devices 0 and 1
- replicate weights across devices 0 and 1
- multiply input tensor and weight
- reduce the results onto device 0
- replicate the result across devices 0 and 1
I know there are some libraries out there that would make what I am outlining relatively simple, but my goal here is to be more explicit with each individual step instead of something like an all_reduce, etc. Not sure if anyone has thoughts on how I should approach this and any considerations I should probably have with it (async vs sync operations is something I likely have to be mindful of).