Paper Reading (ECCV 2020)—DETR: End-to-End Object Detection with Transformers

3 min readSep 5, 2023

The last paper I read about object detection was Faster RCNN, which was published in ICCV 2015. The world has changed so much ever since the transformer model came out! So today I’m explaining an ECCV 2020 paper about the well known DETR model: End-to-End Object Detection with Transformers.

Image source: https://pxhere.com/en/photo/752901

First let’s recap what Object Detection and Faster RCNN is.

Object detection is a concept in computer vision that, describes the task of simultaneously labelling and locating objects of interest. A typical output is a vector of values, containing the class label, and the coordinates of the bounding box containing the corresponding object. In real applications, it’s expected to predict multiple objects per image. Classical machine learning-based algorithm use a combination of feature extractors like HOG, SIFT, or LBP, together with a classifier such as SVM to determine whether the sliding window contains an object.

Faster RCNN extends this idea further. It uses a deep neural network as the feature extractor, usually a ResNet (e.g., ResNet 101 in original paper), to generate the feature map. Instead of using a naive sliding window, it further extends the idea into a region proposal network (RPN), proposing multiple anchor boxes around each pixel. Since these anchor boxes/region of interest (RoI) have different sizes, an RoI pooling layer is needed to align the feature maps with the RoIs. Eventually the aligned RoIs are input into a dense layer for class prediction and bounding box coordinates prediction. To suppress producing too many bounding boxes, a greedy post-processing algorithm called non maximum suppression (NMS) is applied.

What is wrong with Faster RCNN? The many steps including anchor box generation and NMS, makes the algorithm difficult to tune and inefficient.

Why is DETR different then? The similar part is that DETR also uses a CNN backbone as the feature extractor. But instead of using a sliding window along the pixels, DETR naturally derives the RoI by sequence prediction. To be specific, the Transformer encoder takes in a sequence of features (e.g., an image of size 512*512*3 is compressed into a sequence of 512/32*512/32=256 feature vectors), and converts into a new sequence of features (e.g., a sequence of 2000 vectors, corresponding to 2000 predictions of bounding box labels and coordinates after the FF layer) by the Transformer decoder, and the new sequence length doesn’t necessarily equal to the input sequence.

DETR pipeline. Image from the original paper.

However, different from NLP, where the token sequence predicted might be permutational variant, the predicted bounding box sequence is expected to be permutational invariant, and there is no guarantee in the Transformer encoder/decoder architecture.

So DETR further proposed the Hungarian loss during training time. Hungarian algorithm is a bipartite matching algorithm that minimize the total cost assigning pairs of nodes in a graph. The pairwise cost is computed using the log-probability of class labels, as well as a box loss (a combination of L1 loss and generalized IoU loss) for bounding box coordinates.

For comparison, the SOTA benchmark is probably making more sense than the tables in the original table. We can see that variant of DETR ranks #1 and #2 on COCO test-dev and CrowdHuman datasets.

For implementation, the source code of DETR is on Github. HuggingFace also has an API for the DETR model.

Paper Reading (ECCV 2020)—DETR: End-to-End Object Detection with Transformers

Written by Mengliu Zhao

No responses yet