BEiT v2 Masked Image Modeling with Vector-Quantized Visual Tokenizers

# Metadata Source URL:: https://arxiv.org/abs/2208.06366 --- # BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most methods still operate on low-level... ## Highlights > [!quote]+ Updated on 240822_195914 > > most methods still operate on low-level image pixels, which hinders >the exploitation of high-level semantics for representation models. In this >study, we propose to use a semantic-rich visual tokenizer as the reconstruction >target for masked prediction, providing a systematic way to promote MIM from >pixel-level to semantic-level. Specifically, we introduce vector-quantized >knowledge distillation to train the tokenizer, which discretizes a continuous >semantic space to compact codes