Adding temporal dimension to nn.Transformer, nn.Conv2d

Why does nn.Transformer, nn.Conv2d not use a temporal dimension?
For example, if we want to encode a string

I ate an apple

then, after applying nn.TransformerEncoder, we directly get an encoded representation of each word.
what if we want step by step encoding, for example, after first time step, only ‘I’ is encoded, after 2nd time step, ‘I’, ‘ate’, ‘I ate’ is encoded, after third time step, ‘I’, ‘ate’, ‘I ate’, ‘an’, ‘I ate an’ is encoded, and so on, same thing for nn.Conv2d also, in the case of images, if we want to encode a part of image at one time step, another part including the first part at 2nd time step and so on?