File size: 3,119 Bytes
dbe6a3d
 
 
 
 
 
 
 
 
 
 
 
 
 
7c9a15c
52b7c5e
7c9a15c
dbe6a3d
 
 
 
 
 
 
 
 
 
7c9a15c
 
7491b18
7e62c9c
 
7491b18
 
04c4afb
dbe6a3d
 
 
 
 
 
 
 
314c8b1
33ed451
71df5d4
dbe6a3d
71df5d4
8f0cdb2
33ed451
8f0cdb2
 
 
 
 
 
 
33ed451
 
314c8b1
8f0cdb2
 
71df5d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dbe6a3d
71df5d4
 
 
78afd29
71df5d4
dbe6a3d
71df5d4
 
 
 
 
 
dbe6a3d
71df5d4
8f0cdb2
71df5d4
 
 
 
d94ab16
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
language:
- en
base_model:
- openai/clip-vit-large-patch14
tags:
- emotion_prediction
- VEA
- computer_vision
- perceptual_tasks
- CLIP
- EmoSet
---

**PerceptCLIP-Emotions** is a model designed to predict the **emotions** that an image evokes in users. This is the official model from the paper:  
📄 **["Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks"](https://arxiv.org/abs/2503.13260)**.
We apply **LoRA adaptation** on the **CLIP visual encoder** and add an **MLP head** for emotion classification. Our model achieves **state-of-the-art results**.

## Training Details

- *Dataset*: [EmoSet](https://vcc.tech/EmoSet)
- *Architecture*: CLIP Vision Encoder (ViT-L/14) with *LoRA adaptation*
- *Loss Function*: Cross Entropy Loss
- *Optimizer*: AdamW
- *Learning Rate*: 0.0001
- *Batch Size*: 32

## Installation & Requirements
You can set up the environment using environment.yml or manually install dependencies:
- python=3.9.15
- cudatoolkit=11.7
- torchvision=0.14.0
- transformers=4.45.2
- peft=0.14.0

## Usage

To use the model for inference:

```python
from torchvision import transforms
import torch
from PIL import Image
from huggingface_hub import hf_hub_download
import importlib.util

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model class definition dynamically
class_path = hf_hub_download(repo_id="PerceptCLIP/PerceptCLIP_Emotions", filename="modeling.py")
spec = importlib.util.spec_from_file_location("modeling", class_path)
modeling = importlib.util.module_from_spec(spec)
spec.loader.exec_module(modeling)

# initialize a model
ModelClass = modeling.clip_lora_model 
model = ModelClass().to(device)

# Load pretrained model
model_path = hf_hub_download(repo_id="PerceptCLIP/PerceptCLIP_Emotions", filename="perceptCLIP_Emotions.pth")
model.load_state_dict(torch.load(model_path, map_location=device))
model.eval()

# Emotion label mapping
idx2label = {
    0: "amusement",
    1: "awe",
    2: "contentment",
    3: "excitement",
    4: "anger",
    5: "disgust",
    6: "fear",
    7: "sadness"
}

# Preprocessing function
def emo_preprocess():
    transform = transforms.Compose([
        transforms.Resize(224),
        transforms.CenterCrop(size=(224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711)),
    ])
    return transform

# Load an image
image = Image.open("image_path.jpg").convert("RGB")
image = emo_preprocess()(image).unsqueeze(0).to(device)

# Run inference
with torch.no_grad():
    outputs = model(image)
    _, predicted = outputs.max(1)

# Get emotion label
predicted_emotion = idx2label[predicted.item()]
print(f"Predicted Emotion: {predicted_emotion}")
```

## Citation

If you use this model in your research, please cite:

```bibtex
@article{zalcher2025don,
  title={Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks},
  author={Zalcher, Amit and Wasserman, Navve and Beliy, Roman and Heinimann, Oliver and Irani, Michal},
  journal={arXiv preprint arXiv:2503.13260},
  year={2025}
}