--- license: cc-by-nc-sa-4.0 language: - multilingual - fa - en library_name: transformers tags: - text-generation-inference inference: false metrics: - bleu - comet - accuracy - perplexity - spearmanr pipeline_tag: text-generation co2_eq_emissions: emissions: 232380 source: "PersianMind: A Cross-Lingual Persian-English Large Language Model. https://arxiv.org/abs/2401.06466" training_type: "fine-tuning" hardware_used: "4 RTX3090 24GB GPUs" geographical_location: "Tehran, Iran" --- [](https://hf.co/QuantFactory) # QuantFactory/PersianMind-v1.0-GGUF This is quantized version of [universitytehran/PersianMind-v1.0](https://huggingface.co/universitytehran/PersianMind-v1.0) created using llama.cpp # Original Model Card
sentencepiece
and accelerate
libraries along with PyTorch
and 🤗Transformers
to run this code.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
"universitytehran/PersianMind-v1.0",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
device_map={"": device},
)
tokenizer = AutoTokenizer.from_pretrained(
"universitytehran/PersianMind-v1.0",
)
TEMPLATE = "{context}\nYou: {prompt}\nPersianMind: "
CONTEXT = "This is a conversation with PersianMind. It is an artificial intelligence model designed by a team of " \
"NLP experts at the University of Tehran to help you with various tasks such as answering questions, " \
"providing recommendations, and helping with decision making. You can ask it anything you want and " \
"it will do its best to give you accurate and relevant information."
PROMPT = "در مورد هوش مصنوعی ØªÙˆØ¶ÛŒØ Ø¨Ø¯Ù‡."
model_input = TEMPLATE.format(context=CONTEXT, prompt=PROMPT)
input_tokens = tokenizer(model_input, return_tensors="pt")
input_tokens = input_tokens.to(device)
generate_ids = model.generate(**input_tokens, max_new_tokens=512, do_sample=False, repetition_penalty=1.1)
model_output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(model_output[len(model_input):])
```
### How to Quantize the Model
Quantized models can be run on resource-constrained devices.
To quantize the model, you should install the bitsandbytes
library.
In order to quantize the model in 8-bit (`INT8`), use the code below.
```python
model = AutoModelForCausalLM.from_pretrained(
"universitytehran/PersianMind-v1.0",
device_map="auto",
low_cpu_mem_usage=True,
load_in_8bit=True
)
```
Alternatively, you can quantize the model in 4-bit (`NormalFloat4`) with the following code.
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
"universitytehran/PersianMind-v1.0",
quantization_config=quantization_config,
device_map="auto"
)
```
### Evaluating Quantized Models
| Model | Belebele (Persian) | Fa→En Translation