Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
base_model:
|
5 |
+
- openai/clip-vit-large-patch14
|
6 |
+
tags:
|
7 |
+
- emotion_prediction
|
8 |
+
- VEA
|
9 |
+
- computer_vision
|
10 |
+
- perceptual_tasks
|
11 |
+
- CLIP
|
12 |
+
- EmoSet
|
13 |
+
---
|
14 |
+
# Don’t Judge Before You CLIP: Visual Emotion Analysis Model
|
15 |
+
|
16 |
+
This model is part of our paper:
|
17 |
+
*"Don’t Judge Before You CLIP: A Unified Approach for Perceptual Tasks"*
|
18 |
+
It was trained on the *EmoSet dataset* to predict emotion class.
|
19 |
+
|
20 |
+
## Model Overview
|
21 |
+
|
22 |
+
Visual perceptual tasks, such as visual emotion analysis, aim to estimate how humans perceive and interpret images. Unlike objective tasks (e.g., object recognition), these tasks rely on subjective human judgment, making labeled data scarce.
|
23 |
+
|
24 |
+
Our approach leverages *CLIP* as a prior for perceptual tasks, inspired by cognitive research showing that CLIP correlates well with human judgment. This suggests that CLIP implicitly captures human biases, emotions, and preferences. We fine-tune CLIP minimally using LoRA and incorporate an MLP head to adapt it to each specific task.
|
25 |
+
|
26 |
+
## Training Details
|
27 |
+
|
28 |
+
- *Dataset*: [EmoSet](https://vcc.tech/EmoSet)
|
29 |
+
- *Architecture*: CLIP Vision Encoder (ViT-L/14) with *LoRA adaptation*
|
30 |
+
- *Loss Function*: Cross Entropy Loss
|
31 |
+
- *Optimizer*: AdamW
|
32 |
+
- *Learning Rate*: 0.0001
|
33 |
+
- *Batch Size*: 32
|
34 |
+
|
35 |
+
## Performance
|
36 |
+
|
37 |
+
The model was trained on the *EmoSet dataset* using the common train, val, test splits and exhibits *state-of-the-art performance compared to previous methods.
|
38 |
+
|
39 |
+
## Usage
|
40 |
+
|
41 |
+
To use the model for inference:
|
42 |
+
|
43 |
+
```python
|
44 |
+
from torchvision import transforms
|
45 |
+
import torch
|
46 |
+
from PIL import Image
|
47 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
48 |
+
# Load model
|
49 |
+
model = torch.load("EmoSet_clip_Lora_16.0R_8.0alphaLora_32_batch_0.0001_headmlp.pth").to(device).eval()
|
50 |
+
# Load an image
|
51 |
+
image = Image.open("image_path.jpg").convert("RGB")
|
52 |
+
# Preprocess and predict
|
53 |
+
def Emo_preprocess():
|
54 |
+
transform = transforms.Compose([
|
55 |
+
transforms.Resize(224),
|
56 |
+
transforms.CenterCrop(size=(224,224)),
|
57 |
+
transforms.ToTensor(),
|
58 |
+
# Note: The model normalizes the image inside the forward pass
|
59 |
+
# using mean = (0.48145466, 0.4578275, 0.40821073) and
|
60 |
+
# std = (0.26862954, 0.26130258, 0.27577711)
|
61 |
+
])
|
62 |
+
return transform
|
63 |
+
image = Emo_preprocess()(image).unsqueeze(0).to(device)
|
64 |
+
with torch.no_grad():
|
65 |
+
emo_label = model(image).item()
|
66 |
+
print(f"Predicted Emotion: {emo_label}")
|