Unifying Vision, Text, and Layout for Universal Document Processing

We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to mo...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 19254 - 19264
Hlavní autoři: Tang, Zineng, Yang, Ziyi, Wang, Guoxin, Fang, Yuwei, Liu, Yang, Zhu, Chenguang, Zeng, Michael, Zhang, Cha, Bansal, Mohit
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.06.2023
Témata:
ISSN:1063-6919
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and web-sites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark. 1 1 Code and models: https://github.com/microsoft/i-Code/tree/main/i-Code-Doc
ISSN:1063-6919
DOI:10.1109/CVPR52729.2023.01845