CLIP (OpenAI)#

CLIP was trained on a variety of image-text pairs, using a contrastive loss to learn to predict which images go with which texts. This is a zero-shot version of CLIP, where the model has not seen any examples of the tasks at hand. The model is able to predict which images correspond to the text prompts provided. The text being the prediction of the class of the image (hopefully).

We are doing this because models train on imagenet dataset failed to predict the class of the image. So we are using CLIP model to predict the class of the image. It was trained on 400 million image-text pairs.

We can install clip using the following command:

pip install git+https://github.com/openai/CLIP.git

A key difference from traditional CNN#

CLIP matches images to texts. So, you provide an image and a text prompt, and it tells you how well they match. This is different from a traditional CNN, which predicts a class label for an image. CLIP can be used to predict class labels, but it requires a text prompt to do so. the text prompt can be anything. This is called zero-shot learning.

import clip
import torch
from PIL import Image

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

model, preprocess = clip.load("ViT-B/32", device=device) # 300MB model

# load the image
image = preprocess(Image.open("car.jpg")).unsqueeze(0).to(device)
prompts = ["a vintage car", "a pickup truck", "a convertible"]
text = clip.tokenize(prompts).to(device)

# forward pass
with torch.no_grad():
    logits_per_image, _ = model(image, text)
    probs = logits_per_image.softmax(dim=-1)

for i, label in enumerate(text):
    print(f"{i+1}. {probs[0][i].item():.4f} - {label}")

1. 0.8706 - tensor([49406,   320,  3266,  1615, 49407,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0], device='cuda:0',
       dtype=torch.int32)
2. 0.0621 - tensor([49406,   320, 15382,  4629, 49407,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0], device='cuda:0',
       dtype=torch.int32)
3. 0.0671 - tensor([49406,   320, 19608, 49407,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0], device='cuda:0',
       dtype=torch.int32)

Here the first number after each token number is the similarity score of the class of the image. We can convert them to human readable form.

for i, prob in enumerate(probs[0]):
    print(f"{i+1}. {prompts[i]} ({prob:.4f})")

a vintage car (0.8706)
a pickup truck (0.0621)
a convertible (0.0671)

CLIP (OpenAI)#

A key difference from traditional CNN#

This Page