antoyang.github.io - Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models - antoyang.github.io ![rw-book-cover|200x400](https://readwise-assets.s3.amazonaws.com/static/images/article3.5c705a01b476.png) ## Metadata - Author: **antoyang.github.io** - Full Title: Zero-Shot Video Question Answering via Frozen Bidirectional Language Models - Category: #articles - Tags: #ai - URL: https://antoyang.github.io/frozenbilm.html ## Highlights - In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models will be made publicly available.