Definition
CLIP (Contrastive Language-Image Pre-training) is a multimodal model released by OpenAI in 2021 that learns to associate images with natural-language descriptions. It trains an image encoder and a text encoder jointly on roughly 400 million image-text pairs scraped from the web, pulling matching pairs together in a shared vector space while pushing mismatched pairs apart. The result is a model that can compare any image to any piece of text by measuring the cosine similarity of their embeddings.
Why CLIP matters
Because CLIP aligns vision and language in one space, it can classify images into categories it never explicitly trained on simply by scoring candidate text labels against an image. This zero-shot capability removed the need to fine-tune a fresh classifier for every new task. CLIP's image encoder is typically a Vision Transformer or ResNet, and its text encoder is a standard Transformer, both trained with a contrastive loss.
Where CLIP shows up
CLIP's encoders became foundational plumbing across the open-source AI ecosystem. The text encoder steers text-to-image diffusion systems, and the image encoder feeds vision-language models that answer questions about pictures. Open-weight reimplementations such as OpenCLIP let sovereign builders run these capabilities locally without depending on a hosted API.
For the architectural building blocks behind CLIP, see our entries on the vision encoder and the transformer architecture that power both halves of the model.
In Simple Terms
CLIP (Contrastive Language-Image Pre-training) is a multimodal model released by OpenAI in 2021 that learns to associate images with natural-language descriptions. It trains an image…
