Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents

Yiqi Lin1     Alex Jinpeng Wang2     Linjie Li3     Zhengyuan Yang3     Mike Zheng Shou1    
1Show Lab, National University of Singapore     2Central South University     3Microsoft    

Abstract

Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (\Mname), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, \Mname employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning.

Methodology

Methodology overview of VC2L

VC2L explore an alternative vision-centric paradigm for unified vision-language modeling on interleaved web data. A single vision transformer is used to process any image-text modality from pixels and thereby natively learn a unified representation.

Zero Shot Information Retrieval Results

Zero-shot IR results figure 1
Zero-shot IR results figure 2

Visualization

Visualization of VC2L embeddings and retrievals


awesome webpage template