jiangchengchengNLP commited on
Commit
210a895
·
verified ·
1 Parent(s): b11ecdd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -3
README.md CHANGED
@@ -1,3 +1,142 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ metrics:
4
+ - accuracy
5
+ base_model:
6
+ - openai/clip-vit-base-patch32
7
+ ---
8
+ # EmotionCLIP-V2
9
+
10
+
11
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/662f655a02d973f5970ccbd3/x3UxOY4IP1mRX2rdQways.jpeg)
12
+
13
+ ## Project Overview
14
+
15
+ EmotionCLIP is an open-domain multimodal emotion perception model built on CLIP. This model aims to perform broad emotion recognition through multimodal inputs such as faces, scenes, and photos, supporting the analysis of emotional attributes in images, scene layouts, and even artworks.
16
+
17
+ ## Datasets
18
+
19
+ The model is trained using the following datasets:
20
+
21
+ 1. **EmoSet**:
22
+ - Citation:
23
+ ```
24
+ @inproceedings{yang2023emoset,
25
+ title={EmoSet: A Large-Scale Visual Emotion Dataset with Rich Attributes},
26
+ author={Yang, Jingyuan and Huang, Qirui and Ding, Tingting and Lischinski, Dani and Cohen-Or, Danny and Huang, Hui},
27
+ booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
28
+ pages={20383--20394},
29
+ year={2023}
30
+ }
31
+ ```
32
+ - This dataset contains rich emotional labels and visual features, providing a foundation for emotion perception.In this model, We use the dataset Emoset118K.
33
+
34
+ 2. **Open Human Facial Emotion Recognition Dataset**:
35
+ - Contains nearly 10,000 images with emotion labels gathered from wild scenes to enhance the model's capability in facial emotion recognition.
36
+
37
+ 3.**SFEW**
38
+ - The Static Facial Expressions in the Wild (SFEW) dataset is a dataset for facial expression recognition. It is created by selecting static frames from the AFEW database by computing keyframes based on facial point clustering.
39
+
40
+ 4.**Neutral add**
41
+ - Contains 50K images without obvious emotional fluctuations as a supplementary category.
42
+
43
+
44
+ ## training method
45
+
46
+ Combining the fine-tuning methods of layer_norm tuning, prefix tuning, and prompt tuning, the practical results show that the mixture of the three training methods can be comparable to or even exceed the performance of full fine-tuning in generalized visual emotion recognition by introducing only a small number of parameters. In addition, thanks to the adjustment of layer norm, it converges faster than prefix tuning and prompt tuning, achieving higher performance than EmotionCLIP-V1.
47
+
48
+ ## Fine-tuning Weights
49
+
50
+ This repository provides one fine-tuned weights:
51
+
52
+ 1. **EmotionCLIP-V2 Weights**
53
+ - Fine-tuned on the EmoSet 118K dataset, without additional training specifically for facial emotion recognition.
54
+ - Final evaluation results:
55
+ - Loss: 1.5465
56
+ - Accuracy: 0.8256
57
+ - Macro_Recall: 0.7803
58
+ - F1: 0.8235
59
+
60
+
61
+ ## Usage Instructions
62
+
63
+ ```bash
64
+ git clone https://huggingface.co/jiangchengchengNLP/EmotionCLIP-V2
65
+
66
+ cd EmotionCLIP-V2
67
+ # Create your own test file to store images ending in JPG, or organize images from the repository for testing
68
+ # By default, MixCLIP weights are used. Run the following python command in the current folder.
69
+ ```
70
+
71
+ ```python
72
+ from EmotionCLIP import model, preprocess, tokenizer
73
+ from PIL import Image
74
+ import torch
75
+ import matplotlib.pyplot as plt
76
+ import os
77
+ from torch.nn import functional as F
78
+
79
+ # Image folder path
80
+ image_folder = r'./test' #test images are in EmotionCLIP repo : jiangchengchengNLP/EmotionCLIP
81
+ image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith('.jpg')]
82
+
83
+ # Emotion label mapping
84
+ consist_json = {
85
+ 'amusement': 0,
86
+ 'anger': 1,
87
+ 'awe': 2,
88
+ 'contentment': 3,
89
+ 'disgust': 4,
90
+ 'excitement': 5,
91
+ 'fear': 6,
92
+ 'sadness': 7,
93
+ 'neutral': 8
94
+ }
95
+ reversal_json = {v: k for k, v in consist_json.items()}
96
+ text_list = [f"This picture conveys a sense of {key}" for key in consist_json.keys()]
97
+ text_input = tokenizer(text_list)
98
+
99
+ # Create subplots
100
+ num_images = len(image_files)
101
+ rows = 3 # 3 rows
102
+ cols = 3 # 3 columns
103
+ fig, axes = plt.subplots(rows, cols, figsize=(15, 10)) # Adjust the canvas size
104
+ axes = axes.flatten() # Flatten the subplots to a 1D array
105
+ title_fontsize = 20
106
+
107
+ # Iterate through each image
108
+ for idx, img_path in enumerate(image_files):
109
+ # Load image
110
+ img = Image.open(img_path)
111
+ img_input = preprocess(img)
112
+
113
+ # Predict emotion
114
+ with torch.no_grad():
115
+ logits_per_image, _ = model(img_input.unsqueeze(0).to(device=model.device, dtype=model.dtype), text_input.to(device=model.device))
116
+ softmax_logits_per_image = F.softmax(logits_per_image, dim=-1)
117
+ top_k_values, top_k_indexes = torch.topk(softmax_logits_per_image, k=1, dim=-1)
118
+ predicted_emotion = reversal_json[top_k_indexes.item()]
119
+
120
+ # Display image and prediction result
121
+ ax = axes[idx]
122
+ ax.imshow(img)
123
+ ax.set_title(f"Predicted: {predicted_emotion}", fontsize=title_fontsize)
124
+ ax.axis('off')
125
+
126
+ # Hide any extra subplots
127
+ for idx in range(num_images, rows * cols):
128
+ axes[idx].axis('off')
129
+
130
+ plt.tight_layout()
131
+ plt.show()
132
+ ```
133
+
134
+ ## Existing Issues
135
+ The hybrid fine-tuning method improved the model by 2% in the prediction task after the introduction of the neutral category, but this introduction still has noise, which will interfere with the emotion recognition in other scenes. The introduction of prompt tuning is the key to surpassing the effect of full fine-tuning, and the introduction of layer norm tuning makes the convergence faster during training. But this also has disadvantages. After mixing so many fine-tuning methods, the generalization performance of the model has seriously declined. At the same time, the recognition of difficult categories disgust and anger has not been improved. Although I have deliberately added some disgusting pictures of humans, the effect is still not as expected. Therefore, it is still necessary to build a high-quality, large-scale visual emotion dataset. I can feel that the performance of the model is limited by the number of datasets that are far less than the pre-training dataset. At the same time, seeking breakthroughs in model structure will also provide great help for this problem.
136
+
137
+
138
+ ### Summary
139
+ I proposed a hybrid layer_norm prefix_tuning prompt_tuning training method for efficient fine-tuning CLIP, which can make the model converge faster and have performance comparable to full fine-tuning. However, the loss of generalization performance is still a serious problem. I released EmosetCLIP-V2 trained with this training method, which has an additional neutral category compared to EmosetCLIP-V1, and the performance is slightly improved. Future work aims to expand the training data for difficult categories and optimize the model architecture.
140
+
141
+
142
+ ---