Uploaded model

Developed by: suriya7
License: apache-2.0
Finetuned from model : AquilaX-AI/QnA-1.5B

This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

requirements

pip install gguf
pip install transformers

inference

import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "suriya7/qwen-1.5b-quantized"
filename = "unsloth.Q5_K_M.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename,token="")
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename,token="")


# Define the input messages
messages = [
    [
        {
            "role": "system",
            "content": "You are Securitron, a helpful AI assistant specialized in providing accurate and professional responses. Always prioritize clarity and precision in your answers."
        },
        {
            "role": "user",
            "content": "what is ai?"
        },
    ],
]

# Tokenize the input messages
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to("cuda")

# Initialize the TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True)

# Measure the generation time
start_time = time.time()

model.to("cuda")
# Generate text with streaming
with torch.no_grad():
    model.generate(**inputs, max_new_tokens=256, streamer=streamer,do_sample=False)

# Calculate total generation time
end_time = time.time()
total_time = end_time - start_time

print(f"\nTotal Generation Time: {total_time:.2f} seconds")

suriya7
/

qwen-1.5b-quantized

Uploaded model

requirements

inference

Model tree for suriya7/qwen-1.5b-quantized