Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

# Metadata Source URL:: https://antoyang.github.io/frozenbilm.html Topics:: #ai --- # Zero-Shot Video Question Answering via Frozen Bidirectional Language Models ## Highlights > [!quote]+ Updated on 061022_105714 > > In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, > (ii) we train such modules using Web-scraped multi-modal data, and finally > (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. > Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. > It also demonstrates competitive performance in the few-shot and fully-supervised setting. > Our code and models will be made publicly available.