Parrot Captions Teach CLIP to Spot Text

CLIP Score is Innately Flawed !!!

Yiqi Lin1*     Conghui He1*†     Alex Jinpeng Wang2*     Bin Wang1*     Weijia Li3    
Mike Zheng Shou2    
1Shanghai AI Laboratory     2National University of Singapore     3Sun Yat-Sen University     *Equal Contribution     Corresponding Author    

TL;DR

  • Captions in LAION-2B have a significant bias towards describing visual text content embedded in the images.
  • Released CLIP models have strong text spotting bias almost in every style of web images, resulting in the CLIP-filtering datasets inherently biased towards visual text dominant data.
  • CLIP models easily learn text spotting capacity from parrot captions while failing to connect the vision-language semantics, just like a text spotting parrot.
  • We provide an alternative solution by realeasing a less biased filtered LAION-2B 100M subset and pre-trained CLIP models.

Overview

In LAION-2B, image-text pairs with the Top-5% highest similarity score are most dominant by visual text! These samples have dense concurrent text appearing in captions and images (text form in pixels). We refer to their captions as Parrot Captions as they raise a question: Dose CLIP Simply Parroting Text in Images for Vision-Language Alignment? The concurrent text is spotted by the OCR model and highlighted with color in image-text pairs.

Profiling LAION-2B Data

We first do K-Means on the LAION-2B dataset and then use OCR model scan the whole dataset. Surprisingly, we found that 50% of images contain embedded text content. In the clusters with high text image ratio, the top CLIP score samples contain various text sources, such as posters, book covers, advertisements, TV show screenshots, and even PowerPoint slides.

Inspecting Pre-Trained CLIP Models

To answer better why LAION data contains such a high proportion of parrot captions, we inspect the released CLIP models by ablating the embedded text using text inpainting. The CLIP scores significantly drop once we remove the text from the images compared to its random inpainting baseline. It indicates that the parrot captions correlate highly with the CLIP score measurement.

Training CLIP on Emb. Text Curated Data

We dive deeper into the parrot captions by training CLIP models on LAION-2B subsets selected by different embedded-text-oriented criteria under the same setting. Results show that we can easily train a CLIP model with a strong text-spotting bias using data biasing to parrot captions.

Profiling More Datasets

  • MMC4: Image distribution is similiar to LAION-2B.
  • CC12M: Image distribution is less bias towards text than LAION-2B.
  • More details are presented in our Paper.

A Simple Fix

  • We construct a less biased 100M subset from LAION-2B subset with Empty OCR results, CLIP score > 0.3 and Aesthetics score > 4.5.
  • We trained a 100M-scale CLIP Model as a alternative solution to existing CLIP socre flitering pipeline.
  • Dataset and Models see our Github.

BibTeX

      
@article{lin2023parrot,
  title={Parrot Captions Teach CLIP to Spot Text}, 
  author={Yiqi Lin and Conghui He and Alex Jinpeng Wang and Bin Wang and Weijia Li and Mike Zheng Shou},
  journal={arXiv preprint arXiv:2312.14232},
  year={2023}
}
@misc{conghui2022opendatalab,
  author={He, Conghui and Li, Wei and Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua},
  title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
  howpublished = {\url{https://opendatalab.com}},
  year={2022}
}
      

awesome webpage template