The location of `zero_grad` at the training loop

In the PyTorch quick start guide it is written that:

def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

I was wondering, what difference would it make if the zero_grad() operation would be placed at the beginning of the loop:

def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        optimizer.zero_grad()
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

Should it have the same result?

As far as I know, they are equivalent.
Do you observe different results?

@InnovArul is right and both approaches should work, since you are zeroing out the gradients before calculating the new ones from the current iteration.
However, if you want to optimize the code a bit more you could use optimizer.zero_grad(set_to_none=True) which will not fill the .grad attributes with zeros but will delete them (set to None) and could save some memory and avoid an unnecessary sum kernel.
In this case it would be more beneficial to delete the gradients at the end of the training loop or right at the beginning.