Abstract

OCR의 많은 연산량과 inflexibilty 등 아래 3가지 한계를 극복하기 위해 OCR를 배제하고 transformer 구조만을 사용한 Visual Document Understanding.

사전학습은 Cross-entropy Loss 함수 사용

특히, invoice 대상 모델 (아마가 아니라 거의 확실히 네이버 영수증 리뷰,,,)

Introduction

(b) 문자 영역 검출 Text detection

(c)~(d) 검출 영역의 문자 인식 Text Recognition

Untitled

당연히 제안된 방식이 훨씬 좋다는 이야기

Untitled

사전학습에서: Donut learns how to read the texts

predicting the next words by conditioning jointly on the image and previous text contexts
이미지와 이전 단어 문맥을 함께 고려해서 다음 단어 예측

파인튜닝에서: Donut learns how to understand the whole document