Object Detection with Transformers - Pix2Seq implementation in PyTorch

Moein_Shariatnia · August 30, 2022, 12:18pm

I’ve implemented the “Pix2seq: A Language Modeling Framework for Object Detection” paper in PyTorch and written an in-depth tutorial on it.

Here’s the link to the blog on Towards AI.

You can find the whole project on my GitHub

Also, the codes and tutorials are also available as Colab Notebook and Kaggle Notebook.

I hope you like it!

Etienne_Perot · January 9, 2023, 1:57am

hi! i read the code, i think that noisy inputs are missing. also i suspect that evaluation should be done in autoregressive fashion otherwise you are always using the ground truth of previous timesteps instead of your own prediction

Moein_Shariatnia · January 9, 2023, 9:39am

Hey,

Yes, for the sake of simplicity of my implementation and tutorial, I skipped the augmentation technique they used and that’s one reason why I don’t reach the good performance of the paper.

Regarding the evaluation, I am using “teacher forcing” in my implementation and the mask in decoder self-attention layers ensures the models does not peak on future tokens. You can refer to the issues of the repo where I explain in more depth.

Etienne_Perot · January 10, 2023, 7:42pm

I understand the ‘teacher forcing’ technique. But imagine in a true evaluation mode where you do not have the ground truth, what do you feed the network with? For a true evaluation we need the autoregressive setup, otherwise you are using the ground truth as input sequence of previous timesteps instead of your prediction (being with argmax or nucleus sampling or whatever technique)