[paper review] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

date

Aug 29, 2022

slug

paper-review-faster-rcnn

author

status

Public

object dectection 개념 정리

객체 검출(object Detection) 방식 : 2-stage 방식 1-stage 방식 비교하기

2-Stage Detector

물체의 1 위치를 찾는 문제(localization)와 2 분류(classification)문제를 순차적으로 해결

1-Stage Detector

물체의 위치를 찾는 문제(localization)와 분류(classification)문제를 한 번에 해결

객체 검출 방식 : 2-Stage 방식 예시

예) R-CNN, Fast R-CNN, Faster R-CNN

R-CNN

CNN을 이용해 각 Region의 클래스를 분류할 수 있음

전체 프레임워크를 end-to-end 방식으로 학습할 수 없음. → global optimal solution을 찾기 어려움

Fast R-CNN

Feature extraction, rol pooling, region classification, bounding box regression 를 모두 end-to-end로 묶어서 학습 가능

첫 번째 selective search는 cpu에서 수행되므로 속도가 느림

selective search : 인접한 영역끼리 유사성을 측정해 큰 영역으로 차례대로 통합해 나가는 과정

Faster R-CNN

RPN을 제안하여, 전체 프레임워크를 end-to-end로 학습할 수 있음

region classification단계에서 각 특징 벡터는 개별적으로 FC layer로 forward됨.

성능 평가 지표

Average precision

일반적으로 정확도(precision)와 재현율(recall)은 반비례 관계 가짐
따라서 average precision으로 성능을 평가

Intersection over Union(IoU)

IoU: 두 바운딩 박스가 겹치는 비율을 의미

성능평가 예시 : mAP@0.5는 정답과 예측의 IoU가 50% 이상일때 정답으로 판정하겠다라는 의미
NMS 계산 예시 : 같은 클래스(class)끼리 IoU가 50% 이상일 때 낮은 confidence의 box를 제거

NMS(Non Maximum Suppression)

객체 검출(object detection)에서는 하나의 인스턴스(instance)에 하나의 bounding box가 적용되어야함. 따라서, 여러 개의 bounding box가 겹쳐 있는 경우애 하나로 합치는 방법이 필요

Faster R-CNN(NIPS 2015)

Faster R-CNN: Towards Real-Time Object Detection with Region...

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these...

https://arxiv.org/abs/1506.01497

Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

bottleneck에 해당하던 Region Proposal 작업을 GPU장치에서 수행하도록 함.

전체 아키텍처를 end-to-end로 학습 가능

Region Proposal Networks(RPN)

RPN 네트워크는 feature map이 주어졌을 때 물체가 있을 법한 위치를 예측함.

k개의 anchor box를 이용
sliding window을 거쳐 각 위치에 대해 regression과 classification을 수행

네트워크 구조

기존 fast R-CNN구조에서 selective search를 제거하고 RPN을 통해 ROI를 구함.

RPN은 800개 정도의 ROI를 계산하고(selective search가 2000개의 ROI를 계산) 더 높은 정확도를 가져옴.(good)

이미지를 pre-trained된 cnn모델에 입력, feature map 추출

feature map은 RPN에 전달, region proposals 추출

region proposals과 1.과정에서 얻은 feature map을 통해 RoI pooling을 수행 후, 고정된 크기의 feature map 추출

Fast R-CNN에 고정된 크기의 feature map입력하여 classification과 bounding box regression 수행

feature extraction by pre-trained VGG-16

Generate Anchors by Anchor generation layer

Class scores and Bounding box regressor by RPN

Region proposal by Proposal layer

Select anchors for training RPN by Anchor target layer

Select anchors for training Fast R-CNN by Proposal Target layer

Max pooling by RoI pooling

Train Fast R-CNN by Multi-task loss

loss function

Classificaiton과 Bounding Box Regression을 수행→ loss ft은 2가지 task에서 얻은 loss를 엮은 형태임

여기서 i는 하나의 앵커의미 !

pi = classification을 통해 얻은 해당 엥커가 object일 확률

ti = bounding box regression을 통해 얻은 박스 조정 값 백터

결론

그 동안 selective search를 사용하여 계산한 region proposal단계를 neural network안으로 끌고옴(end-to-end object dectection모델 제시)

모든 단계를 다 합쳐 5fps라는 빠른 속도 냄.

pascal VOC를 기준으로 78.8%성능 나타 냄.