Load base model
model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
config = AutoConfig.from_pretrained(model_name)
#it cost few minute to load model in float32, although config is to appoint bfloat16
model = Qwen2_5_VLForConditionalGeneration(config)
model.to(torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_name)
# Apply AWQ
quantization_config_path = "./weights/AWQ_config.json"
quantization_weight_path = "./weights/AWQ_weights.pth"
res = apply_AWQ(model, quantization_config_path, quantization_weight_path=quantization_weight_path)
Inference
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "<url id=\"cuq4ml2misdhuceigs30\" type=\"url\" status=\"failed\" title=\"\" wc=\"0\">https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</url>",
},
{"type": "text", "text": "What does this photo show ?"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
# Inference: Generation of the output
device = "cuda"
inputs = inputs.to(device)
model.to(device)
model.eval()
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Repository Structure AWQ_config.json: Configuration file for AWQ quantization.
quantization.py: Python script for converting the original model to the AWQ quantized version.
model.safetenors: state dict file.
Notes
The quantized model is optimized for memory efficiency but may have a strongly slower (about 3x slower) inference speed compared to the original model, I will try to identify the problem and provide a faster reasoning model at the appropriate time。
Ensure you have sufficient GPU memory (at least 6GB) to load and run the quantized model.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.