What is the correct way to implement a transformer based vision classifier with nn.TransformerEncoder?

I am trying to implement a transformer based temporal sequence classifier.
For each time step, the we have an input tensor of 1x32. So if the classification will be based on 25 consecutive time steps, the input shape will be 1x32x25.

So I implemented a transformer encoder like this:

encoder_layer = nn.TransformerEncoderLayer(d_model=25, nhead=5)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
embed = transformer_encoder(torch.rand(1,32, 25))

Since the output of the generated embedding also involves 25 time steps. How should I consume this embedding? Should I simply grab the last element of the embedding (i.e. embed[:,:,-1]) and pass it to a simple linear layer to generate classes? I thought the vector at any time step in the embed should contain all the information of this sequence of 25 time steps. Not sure if my understanding is correct.

Or I need a way to aggregate the