File size: 3,119 Bytes
dbe6a3d 7c9a15c 52b7c5e 7c9a15c dbe6a3d 7c9a15c 7491b18 7e62c9c 7491b18 04c4afb dbe6a3d 314c8b1 33ed451 71df5d4 dbe6a3d 71df5d4 8f0cdb2 33ed451 8f0cdb2 33ed451 314c8b1 8f0cdb2 71df5d4 dbe6a3d 71df5d4 78afd29 71df5d4 dbe6a3d 71df5d4 dbe6a3d 71df5d4 8f0cdb2 71df5d4 d94ab16 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
---
language:
- en
base_model:
- openai/clip-vit-large-patch14
tags:
- emotion_prediction
- VEA
- computer_vision
- perceptual_tasks
- CLIP
- EmoSet
---
**PerceptCLIP-Emotions** is a model designed to predict the **emotions** that an image evokes in users. This is the official model from the paper:
📄 **["Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks"](https://arxiv.org/abs/2503.13260)**.
We apply **LoRA adaptation** on the **CLIP visual encoder** and add an **MLP head** for emotion classification. Our model achieves **state-of-the-art results**.
## Training Details
- *Dataset*: [EmoSet](https://vcc.tech/EmoSet)
- *Architecture*: CLIP Vision Encoder (ViT-L/14) with *LoRA adaptation*
- *Loss Function*: Cross Entropy Loss
- *Optimizer*: AdamW
- *Learning Rate*: 0.0001
- *Batch Size*: 32
## Installation & Requirements
You can set up the environment using environment.yml or manually install dependencies:
- python=3.9.15
- cudatoolkit=11.7
- torchvision=0.14.0
- transformers=4.45.2
- peft=0.14.0
## Usage
To use the model for inference:
```python
from torchvision import transforms
import torch
from PIL import Image
from huggingface_hub import hf_hub_download
import importlib.util
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the model class definition dynamically
class_path = hf_hub_download(repo_id="PerceptCLIP/PerceptCLIP_Emotions", filename="modeling.py")
spec = importlib.util.spec_from_file_location("modeling", class_path)
modeling = importlib.util.module_from_spec(spec)
spec.loader.exec_module(modeling)
# initialize a model
ModelClass = modeling.clip_lora_model
model = ModelClass().to(device)
# Load pretrained model
model_path = hf_hub_download(repo_id="PerceptCLIP/PerceptCLIP_Emotions", filename="perceptCLIP_Emotions.pth")
model.load_state_dict(torch.load(model_path, map_location=device))
model.eval()
# Emotion label mapping
idx2label = {
0: "amusement",
1: "awe",
2: "contentment",
3: "excitement",
4: "anger",
5: "disgust",
6: "fear",
7: "sadness"
}
# Preprocessing function
def emo_preprocess():
transform = transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(size=(224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711)),
])
return transform
# Load an image
image = Image.open("image_path.jpg").convert("RGB")
image = emo_preprocess()(image).unsqueeze(0).to(device)
# Run inference
with torch.no_grad():
outputs = model(image)
_, predicted = outputs.max(1)
# Get emotion label
predicted_emotion = idx2label[predicted.item()]
print(f"Predicted Emotion: {predicted_emotion}")
```
## Citation
If you use this model in your research, please cite:
```bibtex
@article{zalcher2025don,
title={Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks},
author={Zalcher, Amit and Wasserman, Navve and Beliy, Roman and Heinimann, Oliver and Irani, Michal},
journal={arXiv preprint arXiv:2503.13260},
year={2025}
} |