Horrible Performance - Two Towers Model

PopHSKE · March 18, 2024, 3:26pm

I have been beating my head against the wall and come to the group to see if there is anything that looks like a culprit of very poor performance in a simple Two-Tower model. When i look at a hold out (future time period) set of customer -product interactions, the hit rate is far worse than just random products being offered. I am looking at evaluating at a top 5 product suggestion for the offline evaluation.

The model is very simple and is just a dot product of the cust and product embeddings. There are around 5,000 products and 5.6 million customers:

import torch
import torch.nn as nn
import torch.nn.functional as F

class TwoTowerMF(nn.Module):
    
    def __init__(self, n_items, n_users, emb_dim):
        super(TwoTowerMF, self).__init__()
        
        self.emb_dim = emb_dim
        self.item_emb = nn.Embedding(num_embeddings=n_items, embedding_dim=self.emb_dim, padding_idx=0)
        self.user_emb = nn.Embedding(num_embeddings=n_users, embedding_dim=self.emb_dim, padding_idx=0)
         
        self.dot = torch.matmul

        
    def forward(self, items, users):
        
        item_emb = self.item_emb(items)
        user_emb = self.user_emb(users)

        score = self.dot(user_emb, item_emb.t())
        
        return score

The dataset uses the interactions held in memory. The item ID, customer ID and CSP which is just the probability that the item would be in the batch is yielded :

class interactionDataset(Dataset):
    def __init__(self, item_user_inters):
        self.inters = torch.LongTensor(item_user_inters[:,0:2]) #  user and item
        self.csp = torch.Tensor(item_user_inters[:,2]) # csp
        
    def __len__(self):
        return len(self.inters)  # number interactions

    def __getitem__(self, idx):
        inter = self.inters[idx]
        csp = self.csp[idx]
        
        return inter[0], inter[1], csp # first array are item ID, second are user ID, third is csp

Im using DDP with a 4 GPU instance

train_ds = interactionDataset(pdf_ints_train)

if args.is_distributed:
        # distributed training sampler
        train_sampler = torch.utils.data.distributed.DistributedSampler(
                        train_ds, num_replicas=args.world_size, rank=rank
        )
    

    train_loader = DataLoader(train_ds, 
                              shuffle= not args.is_distributed, 
                              sampler= train_sampler if args.is_distributed else None,
                              batch_size=args.batch_size_train,
                              drop_last=False, 
                              num_workers = args.num_workers)

The training loop is straight forward (I have tried with and without the adjustment to the loss based on the csp).

def train(args, model, device, train_loader, optimizer, epoch, loss_fn, rank):
    
    model.train()
    train_loss = 0.0
    
    
    b_sz = len(next(iter(train_loader))[0])
    print(f"[GPU{args.rank}] Epoch {epoch} | Batchsize: {b_sz}")
        
    
    for batch_idx, (items, custs, csp) in enumerate(train_loader):
        custs = custs.to(device)
        items = items.to(device)
        csp = csp.to(device)


        optimizer.zero_grad()

        # Forward pass
        scores = model(items, custs)

        # Compute Loss
        loss = loss_fn(scores - torch.log(csp), torch.eye(custs.shape[0]).to(device))
        train_loss += loss.item()


        # Backward pass
        loss.backward()
        
        optimizer.step()

The loss is in-batch soft max which i think is setup just like Tensorflow recommenders.

model = TwoTowerMF(n_items=n_items+1, n_users=n_users+1, emb_dim=args.emb_dim)
    
    # must come before call to DDP
    model.to(device)
    
    if args.is_distributed:
        model = DDP(model,device_ids=[local_rank])
    
    # Pin each GPU to a single library process
    torch.cuda.set_device(local_rank)
    

    
    loss_fn = nn.CrossEntropyLoss(reduction='sum')
    optimizer = torch.optim.Adam(model.parameters(), lr = args.lr * math.sqrt(args.num_gpus))

I have tried various learning rates, batch sizes etc and the model is always horrible.

Looking the loss for the first epoch, the model decreases loss in each batch but rough hits a period where it stops much improvement.