Question about DETR gradient backpropagation to object queries.

smartbarbarian · January 28, 2023, 4:19pm

Assume we only have 4 object queries and the batch size is one, only the last decoder layer will output loss, and negative labels will not be calculated into the loss. If we set tgt_mask like
[[False, False, True, True],
[False, False, True, True],
[True, True, False, False],
[True, True, False, False],]
And only the first object query matches the True label within one step during training, the last two object queries will not be updated during gradient back propagation. But I found that all object queries have been updated. Can anyone clarify it? Thank you so much.

Peter_Vennerstrom · January 29, 2023, 1:07am

DETR assigns all queries as either positive or negative samples and they are supervised with a classification loss. Only the queries assigned as positive samples are supervised with regression losses related to the bounding box prediction.

Each query will be updated by classification loss and the subset of queries assigned as positive samples (foreground) will also be updated by regression losses.

smartbarbarian · January 29, 2023, 5:40am

Thank you so much for replying the post! I just updated the post. Because I can’t modify it when it was pending state. In my expirement, I ignored negative samples and only calculate loss caused by positive samples by using torch.gather. And only caluate loss of the last decoder layer output. So the other object queries should not be updated theoretically. But the result make me confused.

smartbarbarian · January 29, 2023, 7:22am

I found it may be caused by the momentum in optimzer. So it’s is correct such tgt masks can prevent interactions among object queries?

Peter_Vennerstrom · January 31, 2023, 3:22pm

tgt_mask will prevent interactions with the tgt sequence, but they will be still be supervised by DETR loss functions by default. To prevent specific queries from being supervised, the loss functions will need the same per query masks.

Does your system work in an autoregressive manner?

smartbarbarian · February 10, 2023, 9:37am

It is used for trajectory prediction. A lot of papers uses multiple decoders to ensemble results and achieve a better performance. So I was thinking that could we use tgt_mask to make a single large decoder runnning like multiple small decoders.