# Metadata
Source URL:: https://arxiv.org/abs/2112.05253
Topics:: #ai
---
# MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning
Large-scale pretraining is fast becoming the norm in Vision-Language (VL)
modeling. However, prevailing VL approaches are limited by the requirement for
labeled data and the use of complex multi-step pretraining objectives. We
present MAGMA - a simple method for augmenting generative language models with
additional modalities using adapter-based finetuning. Building on Frozen, we
train a series of VL models that autoregressively generate text from arbitrary
combinations of visual and textual input. The pretraining is entirely
end-to-end using a single language modeling objective, simplifying optimization
compared to previous approaches. Importantly, the language model weights remain
unchanged during training, allowing for transfer of encyclopedic knowledge and
in-context learning abilities from language pretraining. MAGMA outperforms
Frozen on open-ended generative tasks, achieving state of the art results on
the OKVQA benchmark and competitive results on a range of other popular VL
benchmarks, while pretraining on 0.2% of the number of samples used to train
SimVLM.
## Highlights
> [!quote]+ Updated on 161022_112629
>
> MAGMA - a simple method for augmenting generative language models with
>additional modalities using adapter-based finetuning. Building on Frozen, we
>train a series of VL models that autoregressively generate text from arbitrary
>combinations of visual and textual input. The pretraining is entirely
>end-to-end using a single language modeling objective, simplifying optimization
>compared to previous approaches. Importantly, the language model weights remain
>unchanged during training, allowing for transfer of encyclopedic knowledge and
>in-context learning abilities from language pretraining. MAGMA outperforms
>Frozen on open-ended generative tasks, achieving state of the art results on
>the OKVQA benchmark and competitive results on a range of other popular VL
>benchmarks, while pretraining on 0.2% of the number of samples used to train
>SimVLM.