What is aten:to in the profiler

Alexander_Soare · April 27, 2021, 6:20pm

This is a profile of

with profiler.profile(record_shapes=True) as prof:
    with profiler.record_function("model_train"):
        out = model.forward(data[0].to(DEVICE))
        loss = compute_loss(data[1:], out)
        loss.backward()

What’s up with the long aten:to block in the middle? At first I thought it was the operation of copying tensors from cpu to gpu. This still makes sense to me, but I don’t understand why the second block is so long.

As part of computing my loss I do a to(DEVICE) on my targets, but that’s supposed to be relatively very small.

criterion = nn.CrossEntropyLoss(ignore_index=tkn.stoi['<pad>'])

def compute_loss(targets, outputs):
        # targets shape [(B, padded_seq_length), (B, 1)]
        # outputs shape (B, S, N+1) S is sequence length of output, N is n_classes
        caption_lengths, sort_ind = targets[1].squeeze(1).sort(dim=0, descending=True)
        decode_lengths = (caption_lengths - 1).tolist()
        predictions = pack_padded_sequence(outputs[sort_ind], decode_lengths, batch_first=True).data
        targets = pack_padded_sequence(targets[0][sort_ind][:, 1:].to(DEVICE), decode_lengths, batch_first=True).data
        loss = criterion(predictions, targets)
        return loss

Then, I tried returning loss=0 without actually computing a loss like this:

criterion = nn.CrossEntropyLoss(ignore_index=tkn.stoi['<pad>'])

def compute_loss(targets, outputs):
        # targets shape [(B, padded_seq_length), (B, 1)]
        # outputs shape (B, S, N+1) S is sequence length of output, N is n_classes
        caption_lengths, sort_ind = targets[1].squeeze(1).sort(dim=0, descending=True)
        decode_lengths = (caption_lengths - 1).tolist()
        predictions = pack_padded_sequence(outputs[sort_ind], decode_lengths, batch_first=True).data
        targets = pack_padded_sequence(targets[0][sort_ind][:, 1:].to(DEVICE), decode_lengths, batch_first=True).data
        return nn.Parameter(torch.tensor([0]).float())

This actually gets rid of the second aten:to block

So why is the loss taking so long to compute? And was I right in saying that aten:to is the cpu > gpu operation? If so, what does this have to do with calculating the loss?

On a side note, I noticed that as I scale my model size, that second aten:to block takes longer to do (proportionally to the total time of the loop iteration). Why should it, if the loss is calculated independently of model size?

grudloff · February 10, 2022, 6:58pm

Hello @Alexander_Soare! Did do manage to catch what is the source of that long aten::to block? I am running into a similar issue.

Alexander_Soare · February 12, 2022, 11:36am

Nope sorry. This problem has been long forgotten.

lellowood · February 17, 2025, 2:54pm

I encountered the same issue, which was ultimately traced back to an operation involving copying a variable across different devices, specifically: tensor.item().

ptrblck · February 17, 2025, 2:57pm

In case other users struggle to isolate the place where an operation or kernel is called from: you can enable Python backtraces in Nsight Systems allowing you to hover over each kernel and showing the backtraces through PyTorch’s C++ backend as well as your Python scripts.