Q4KS

#4
by Alastar-Smith - opened

Hello!
Can we get Q4KS Qs, since Q4KM are too slow for me, it is a 1gb difference.
Or I can just download Bartowski's variant from https://huggingface.co/bartowski/Qwen_QwQ-32B-GGUF ?
Is there a difference between Bartowski and Unsloth Qs?

I too am wondering what exactly is different about the dynamic quants and if it is relevant to GGUF for llama.cpp or just the bnb-4bit for vLLM.

Let's look closer at some available information:

  1. The recent DeepSeek-R1 Unsloth Dynamic flavor quants labeled UD e.g. UD-Q2_K_XL do use a custom unslothai fork of llama.cpp's llama-quant.cpp though the modifications seem specific to DeepSeek-V3 MoE architecture. This is a great model btw if you have 96GB+ RAM and a single 16GB+ VRAM CUDA GPU for ktransformers
  2. The blog post methodology to create the above UD quant uses same Bartowski importance matrix when used on smaller quants.
  3. unsloth/QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf = 19.9GB
  4. bartowski/Qwen_QwQ-32B-GGUF/Qwen_QwQ-32B-Q4_K_M.gguf = 19.9GB
  • llm_load_print_meta: EOS token = 151645 '<|im_end|>'
  • llm_load_print_meta: PAD token = 151643 '<|endoftext|>'

So the unsloth blog post on QwQ-32B mentions an additional bug fix in the tokenizer which likely mostly effects fine tuning:

The EOS token is correct, but the PAD token should probably rather be "<|vision_pad|>"

"eos_token": "<|im_end|>",
"pad_token": "<|endoftext|>",

So from what I can tell given the file size is exactly the same, so you are probably fine using the bartowski quant if you need a specific Q4_K_S size. I've had good luck with bartowski's IQ4_XS which barely fits 32k context on my 3090TI's 24GB VRAM at over 30 tok/sec.

If you are fine-tuning, then pay closer attention to that pad_token bug fix which does not seem to exist in the bartowski quant that I have.

If you want the "dynamic quant", you likely have to use vLLM with the 22.5GB unsloth/QwQ-32B-unsloth-bnb-4bit something like this:

# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh

# https://docs.vllm.ai/en/stable/getting_started/quickstart.html
mkdir vllm
cd vllm/
uv venv ./venv --python=3.12 --python-preference=only-managed
source  venv/bin/activate
uv pip install vllm "bitsandbytes>=0.43.5"
vllm --version
# INFO 03-07 11:13:15 __init__.py:207] Automatically detected platform cuda.
# 0.7.3

# https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve
OMP_NUM_THREADS=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve \
    unsloth/QwQ-32B-unsloth-bnb-4bit \
    --download-dir /mnt/raid/models/ \
    --load-format bitsandbytes \
    --quantization bitsandbytes \
    --dtype auto \
    --kv-cache-dtype auto \
    --max-model-len 32768 \
    --host 127.0.0.1 \
    --port 8080

NOTE: You must explicitly specify bitsandbytes for load-format and quantization otherwise even if it detects correctly using auto it will crash with a Key Error on the last layer or something.

Also fwiw, I believe bitsandbytes is not supported with tensor parallism in vLLM yet so question this potential ai slop over here unless they show proof of 2x GPUs: https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit/discussions/4#67ccca675847e4787a22c8e8 (it might work on 4090+ archtecture 9.0+ only and not on older 8.6 3090TI / RTX A6000)...

vLLM bug showing tensor parallism not yet supported: https://github.com/vllm-project/vllm/issues/14449

fwiw i just tested the barwtowski Qwen_QwQ-32B-IQ4_XS.gguf 1-shot flappy bird using ik_llama.cpp fork and it looks good to me. Let us know how you work out! (It used ~14k context so you will want 16k minimum and preferably more).

Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.

Thank you for the explanation!
I'm using 3060 12gb and it is really important to optimize the size of Qs.
Q4KS with 12k context at Q8 give me a decent speed of 3.5-3.8 tps.
But Q4KM give me close to 3-3.2 tps.

@Alastar-Smith

:oof: yeah 12GB is not really enough for a 32B model - especially a CoT reasoning model which requires additional context length. You take a big penalty for not offloading everything onto GPU. Honestly consider trying the IQ2_XXS and fill up the remaining VRAM with context (maybe 8k will fit?)... The IQ3_XXS is already going to use more than 12GB, but might be worth a shot with just a few layers in RAM.

Have fun!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment