Uploaded model

  • Developed by: suriya7
  • License: apache-2.0
  • Finetuned from model : AquilaX-AI/QnA-1.5B

This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

requirements

pip install gguf
pip install transformers

inference

import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "suriya7/qwen-1.5b-quantized"
filename = "unsloth.Q5_K_M.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename,token="")
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename,token="")


# Define the input messages
messages = [
    [
        {
            "role": "system",
            "content": "You are Securitron, a helpful AI assistant specialized in providing accurate and professional responses. Always prioritize clarity and precision in your answers."
        },
        {
            "role": "user",
            "content": "what is ai?"
        },
    ],
]

# Tokenize the input messages
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to("cuda")

# Initialize the TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True)

# Measure the generation time
start_time = time.time()

model.to("cuda")
# Generate text with streaming
with torch.no_grad():
    model.generate(**inputs, max_new_tokens=256, streamer=streamer,do_sample=False)

# Calculate total generation time
end_time = time.time()
total_time = end_time - start_time

print(f"\nTotal Generation Time: {total_time:.2f} seconds")
Downloads last month
168
GGUF
Model size
1.54B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for suriya7/qwen-1.5b-quantized

Quantized
(1)
this model