Which implementation performs mini-batch gradient descent correctly?

Hello, I want to implement such mini-batch gradient descent step below:


where d(h+l,t) is the distance of two vector h+l and t. y is the margin and [·]+ is the maximun of 0 and the value inside it.

Now I have two implementations of this and they behave differently. But I don’t know which one is correct for this algorithm above.

  • Version 1 : Get a list of losses of mini-batch samples, and perform .backward() on loss.mean()
class Model():
    ...
   def forward(self, p_h, p_l, p_t, n_h, n_l, n_t):
         dis1 = (self.entity_em(p_h) + self.relation_em(p_l) - self.entity_emb(p_t)).norm(p=2, dim=1)
         dis2 = (self.entity_em(n_h) + self.relation_em(n_l) - self.entity_emb(n_t)).norm(p=2, dim=1)
         return self.loss(dis1, dis2)
   def loss(self, dis1, dis2):
         target = torch.tensor([-1])
         criterion = nn.MarginRankingLoss(margin=gamma, reduction='none')
         return criterion(dis1, dis2, target)
   ...
model = Model()
for batch_index in ... : # mini-batch loop
   optimizer.zero_grad()
   loss = model(p_h, p_l, p_t, n_h, n_l, n_t)
   total_loss += loss.sum().item()
   loss.mean().backward()
   optimizer.step()
  • Version 2 : Get the sum of losses of mini-batch samples and perform .backward() on loss
class Model():
    ...
   def forward(self, p_h, p_l, p_t, n_h, n_l, n_t):
         dis1 = (self.entity_em(p_h) + self.relation_em(p_l) - self.entity_emb(p_t)).norm(p=2, dim=1)
         dis2 = (self.entity_em(n_h) + self.relation_em(n_l) - self.entity_emb(n_t)).norm(p=2, dim=1)
         dis_diff = self.gamma + dis1 - dis2
         result = torch.sum(F.relu(distance_diff))
         return result
   ...
model = Model()
for batch_index in ... : # mini-batch loop
   optimizer.zero_grad()
   loss = model(p_h, p_l, p_t, n_h, n_l, n_t)
   total_loss += loss.item()
   loss.backward()
   optimizer.step()

So which one should I use and why?