| _ | textvqa | _ |
|---|---|---|
| Beyond OCR + VQA: Towards end-to-end reading and reasoning for robust and accurate | textvqa | |
| Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for | textvqa | |
| multimodal attention fusion network with a dynamic vocabulary for | textvqa | , A |
| Separate, Locate, and Align: Determine Context Relation of Scene Text From Multiple Perspectives in | textvqa | |
| Spatially Aware Multimodal Transformers for | textvqa | |
| Structured Multimodal Attentions for | textvqa |