File size: 4,820 Bytes
2f70d78
 
 
 
 
 
 
 
 
 
 
 
963e8c2
60cb73f
963e8c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f70d78
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: mit
datasets:
- CodeHima/TOS_DatasetV3
language:
- en
metrics:
- accuracy
- precision
base_model: FacebookAI/roberta-base
pipeline_tag: text-classification
---
# TOSRoberta-base

## Model Overview

**Model Name:** TOSRoberta-base  
**Model Type:** Sequence Classification  
**Base Model:** [RoBERTa-base](https://huggingface.co/roberta-base)  
**Language:** English  
**Task:** Classification of unfairness levels in Terms of Service (ToS) documents

**Model Card Version:** 1.0  
**Author:** CodeHima

## Model Description

The `TOSRoberta-base` model is a fine-tuned version of `RoBERTa-base` for classifying clauses in Terms of Service (ToS) documents into three categories:
- **Clearly Fair**
- **Potentially Unfair**
- **Clearly Unfair**

This model has been fine-tuned on a custom dataset labeled with the above categories to help identify unfair practices in ToS documents.

## Intended Use

### Primary Use Case
The primary use case of this model is to classify text from Terms of Service documents into different levels of fairness. It can be particularly useful for legal analysts, researchers, and consumer protection agencies to quickly identify potentially unfair clauses in ToS documents.

### Limitations
- **Dataset Bias:** The model has been trained on a specific dataset, which may introduce biases. It may not generalize well to all types of ToS documents.
- **Context Understanding:** The model may struggle with clauses that require deep contextual or legal understanding.

## Performance

### Training Configuration
- **Batch Size:** 32 (training), 16 (evaluation)
- **Learning Rate:** 1e-5
- **Epochs:** 10
- **Optimizer:** AdamW
- **Scheduler:** Linear with warmup
- **Training Framework:** PyTorch using HF中国镜像站's `transformers` library
- **Mixed Precision Training:** Enabled (fp16)
- **Resource:** Trained on a single NVIDIA T4 GPU (15 GB VRAM)

### Training Metrics

| Epoch | Training Loss | Validation Loss | Accuracy | F1   | Precision | Recall |
|-------|---------------|-----------------|----------|------|-----------|--------|
| 1     | 0.668100      | 0.620207        | 0.740000 | 0.727| 0.728     | 0.740  |
| 2     | 0.439800      | 0.463925        | 0.824762 | 0.821| 0.826     | 0.825  |
| 3     | 0.373500      | 0.432604        | 0.831429 | 0.832| 0.834     | 0.831  |
| 4     | 0.342800      | 0.402661        | 0.854286 | 0.854| 0.853     | 0.854  |
| 5     | 0.283800      | 0.434868        | 0.829524 | 0.832| 0.840     | 0.830  |
| 6     | 0.218000      | 0.437268        | 0.859048 | 0.859| 0.859     | 0.859  |
| 7     | 0.266800      | 0.508120        | 0.820952 | 0.824| 0.834     | 0.821  |
| 8     | 0.139600      | 0.486364        | 0.855238 | 0.856| 0.856     | 0.855  |
| 9     | 0.085000      | 0.530111        | 0.844762 | 0.846| 0.850     | 0.845  |
| 10    | 0.103600      | 0.528026        | 0.842857 | 0.844| 0.847     | 0.843  |

**Final Validation Accuracy:** 85.90%  
**Final Test Accuracy:** 85.65%

### Evaluation Metrics
- **Accuracy:** 85.65%
- **F1 Score:** 85.60%
- **Precision:** 85.61%
- **Recall:** 85.65%

## Dataset

The model was trained on the `CodeHima/TOS_DatasetV3`, which includes labeled clauses from ToS documents. The dataset is split into training, validation, and test sets to ensure reliable performance evaluation.

**Dataset Labels:**
- `clearly_fair`
- `potentially_unfair`
- `clearly_unfair`

## How to Use

Here’s how you can use the model with the HF中国镜像站 `transformers` library:

```python
from transformers import RobertaTokenizer, RobertaForSequenceClassification

# Load the model
model = RobertaForSequenceClassification.from_pretrained('CodeHima/TOSRoberta-base')
tokenizer = RobertaTokenizer.from_pretrained('CodeHima/TOSRoberta-base')

# Predict the unfairness level of a clause
text = "Insert clause text here."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()

# Map the predicted class to the corresponding label
label_mapping = {0: 'clearly_fair', 1: 'potentially_unfair', 2: 'clearly_unfair'}
predicted_label = label_mapping[predicted_class]
print(f"Predicted Label: {predicted_label}")
```

## Ethical Considerations

- **Bias:** The model's predictions may reflect biases present in the training data.
- **Fair Use:** Ensure the model is used responsibly, especially in legal contexts where human oversight is critical.

## Conclusion

The `TOSRoberta-base` model is a reliable tool for identifying unfair clauses in Terms of Service documents. While it performs well, it should be used in conjunction with expert analysis, particularly in legally sensitive contexts.

**Model Repository:** [CodeHima/TOSRoberta-base](https://huggingface.co/CodeHima/TOSRoberta-base)