I have the output of a Bert model that is H = [batch_size, sequence_length, hidden_size]. I have another tensor indicating the sequence length of each sample in the batch, L = [batch_size, 1].

How would I compute the max of H over the sequence length (dim=1) without including the padding tokens (i.e. only including the positions less than that specificied by L)?

I have a solution for taking the mean,

```
mask = torch.arange(H.shape[1]).repeat(H.shape[0], 1)
mask = (mask < L).float()
mask[:, 0] = 0
masked_h = H * mask.unsqueeze(2)
mean_emb = (masked_h.sum(dim=1)/L)
```

Your approach looks, but it would calculate the mean, wouldn’t it?

Ya sorry, maybe that part of the post is a bit irrelevant. I meant to say I know how to calcualte the mean but calculating the max seems trickier and I have yet to figure it out.

You could use a similar approach using a mask.

I don’t know what values your padding tokens have, but you could set the mask values for all padding values to a large negative value (`-Inf`

should work), multiply `H`

with this mask, and apply `max()`

on it.

Let me know, if that would work.

This is also the solution I thought of but the problem is a hidden vector is float32 in general and can contain negative values. Multiplying by -inf gets +inf. The padding tokens can be assumed to be 0 for simplicity.