Self-attention and Positional encoding in DETR

The way I understand it in LM, positional encodings and self-attention are computed for the input tokens.

But what about object detection using DETR? What are the inputs and what are their positions? I couldn’t find these details in the paper. It seems like it is pixels in the image, but I can’t quite verify it.