PAPERS

ViLBERT

February 21, 2020

Problem

Currently the dominant strategy for vision-and-language tasks is to pretrain models for the two domains separately and then attempt learn visual grounding by training them together
However, this approach does not accurately capture the relations between visual and linguistic features and how they effect one another.
Need a model that is pre-trained for visual grounding.

Solution:

Paper presents a joint model for learning task-agnostic visual grounding from paired visiolinguistic data which is called Vision & Language BERT.
Uses separate streams for vision and language processing, but these streams communicate using co-attentional transformer layers. Provides interaction between the different streams at various depths.

Extending BERT to jointly represent Images and Text:

ViLBERT consists of two parallel BERT-style models operating over image regions and text segments.
Each stream is a series of transformer blocks (TRM) and co-attentional transformer layers (Co-TRM) which is introduced to enable information exchange between the two modalities.
Given a text input ($w_0,...,w_T$) and region features ($v_1,...,vT$) the model outputs final representions $(h{v0},..,h_{vT})$ and $(h_{w0},..,h_{wT})$. The interaction between the streams is limited to only certain layers.

Co-Attentional Transformer Layers:

Given intermediate visual and linguistic representations ${H_V}^{(i)}$ and ${H_W}^{(j)}$, the model computes the query, key and value matrices as in a standard transformer block.
However, the keys and values from each modality is passed as input to the other modality's multi-headed attention block. This causes the attention block to produce features for each modality conditioned on the other

Training:

Two pre-training tasks are considered: masked multi-modal modelling and multi-modal alignment prediction.
The masked multi-modal task is similar to what is used in standard BERT. Masked image regions have image features zeroed out 90% of the time.
In multi--modal alignment task, the model is present an image-text pair and must predict whether the image and text are aligned correctly.

Vision-and-Language Transfer Tasks:

Results

ViLBERT does better than single stream models on downstream tasks.
Pre-training ViLBERT using proxy tasks improves performance
Finetuning from ViLBERT, transfer task performance exceeds state-of-the-art task specific models for the 4 tasks