Tensor parallelism simple Embedding Example

caglar_demir · November 18, 2024, 7:49pm

Hi,
I was doing through the examples of tensor parallel. After spending few hours, I still don’t get my head around.
Let’s assume that we have the following class and we have 2 GPUs.
How can we use ColwiseParallel tensor paralelisim to store the first half of the entity_embeddings and of the relation_embeddings in the first GPU and other halfs into second GPU.


class DistMult(nn.Module):
    def __init__(self):
        super().__init__(args)
        self.entity_embeddings = torch.nn.Embedding(135, 32)
        self.relation_embeddings = torch.nn.Embedding(46, 32)
    def forward(self,h,r,t):
    h=self.entity_embeddings(h)
    r=self.relation_embeddings(r)
    t=self.entity_embeddings(t)
    return ((h * r) * t).sum(dim=1)

Constantly, the following RuntimeError occurs
RuntimeError: Function EmbeddingBackward0 returned an invalid gradient at index 0 - got [135, 32] but expected shape compatible with [46, 32]

tianyu · November 22, 2024, 2:11am

Hi. Can you provide the end-to-end code so that we can help? In your code I didn’t see TP being applied.

Btw, you may find the tutorial helpful
https://pytorch.org/tutorials/intermediate/TP_tutorial.html

caglar_demir · November 22, 2024, 8:10pm

Hi @tianyu,

Thank you for your response. Thank you for the link. I have went through it already.
Yet, I couldn’t really make use of it

Here is an end-to-end code example.

import torch
class DistMult(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.entity_embeddings = torch.nn.Embedding(135, 32)
        self.relation_embeddings = torch.nn.Embedding(46, 32)
    def forward(self,h,r,t):

        h=self.entity_embeddings(h)
        r=self.relation_embeddings(r)
        t=self.entity_embeddings(t)
        return ((h * r) * t).sum(dim=1)

model=DistMult()
optim=torch.optim.Adam(model.parameters(), lr=1e-3)
triples=torch.LongTensor([[0,0,0],
                          [1,1,1],
                          [2,2,2]])
for i in range(100):
    yhat=model.forward(triples[:,0],triples[:,1],triples[:,2])
    loss=torch.nn.functional.binary_cross_entropy_with_logits(yhat,torch.ones_like(yhat))
    loss.backward()
    print(loss)
    optim.step()
    optim.zero_grad()

The goal is to apply columnwise TP so that
the first 16 columns of entity_embeddings and relation_embeddings are kept on the first GPU and the last 16 columns of entity_embeddings and relation_embeddings are on the second GPU.

tianyu · November 25, 2024, 10:58pm

In your code, I didn’t see TP being applied. You need to call parallelize_module explicitly with a plan (e.g. "tok_embeddings": RowwiseParallel()) to parallelize the module before usage.