Q4KS
Hello!
Can we get Q4KS Qs, since Q4KM are too slow for me, it is a 1gb difference.
Or I can just download Bartowski's variant from https://huggingface.co/bartowski/Qwen_QwQ-32B-GGUF ?
Is there a difference between Bartowski and Unsloth Qs?
I too am wondering what exactly is different about the dynamic
quants and if it is relevant to GGUF for llama.cpp or just the bnb-4bit for vLLM.
Let's look closer at some available information:
- The recent DeepSeek-R1 Unsloth Dynamic flavor quants labeled
UD
e.g.UD-Q2_K_XL
do use a custom unslothai fork of llama.cpp's llama-quant.cpp though the modifications seem specific to DeepSeek-V3 MoE architecture. This is a great model btw if you have 96GB+ RAM and a single 16GB+ VRAM CUDA GPU for ktransformers - The blog post methodology to create the above
UD
quant uses same Bartowski importance matrix when used on smaller quants. unsloth/QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf
= 19.9GBbartowski/Qwen_QwQ-32B-GGUF/Qwen_QwQ-32B-Q4_K_M.gguf
= 19.9GB
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
So the unsloth blog post on QwQ-32B mentions an additional bug fix in the tokenizer which likely mostly effects fine tuning:
The EOS token is correct, but the PAD token should probably rather be "<|vision_pad|>"
"eos_token": "<|im_end|>",
"pad_token": "<|endoftext|>",
So from what I can tell given the file size is exactly the same, so you are probably fine using the bartowski quant if you need a specific Q4_K_S
size. I've had good luck with bartowski's IQ4_XS
which barely fits 32k context on my 3090TI's 24GB VRAM at over 30 tok/sec.
If you are fine-tuning, then pay closer attention to that pad_token
bug fix which does not seem to exist in the bartowski quant that I have.
If you want the "dynamic quant", you likely have to use vLLM
with the 22.5GB unsloth/QwQ-32B-unsloth-bnb-4bit something like this:
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
# https://docs.vllm.ai/en/stable/getting_started/quickstart.html
mkdir vllm
cd vllm/
uv venv ./venv --python=3.12 --python-preference=only-managed
source venv/bin/activate
uv pip install vllm "bitsandbytes>=0.43.5"
vllm --version
# INFO 03-07 11:13:15 __init__.py:207] Automatically detected platform cuda.
# 0.7.3
# https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve
OMP_NUM_THREADS=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve \
unsloth/QwQ-32B-unsloth-bnb-4bit \
--download-dir /mnt/raid/models/ \
--load-format bitsandbytes \
--quantization bitsandbytes \
--dtype auto \
--kv-cache-dtype auto \
--max-model-len 32768 \
--host 127.0.0.1 \
--port 8080
NOTE: You must explicitly specify bitsandbytes
for load-format and quantization otherwise even if it detects correctly using auto
it will crash with a Key Error on the last layer or something.
Also fwiw, I believe bitsandbytes is not supported with tensor parallism in vLLM
yet so question this potential ai slop over here unless they show proof of 2x GPUs: https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit/discussions/4#67ccca675847e4787a22c8e8 (it might work on 4090+ archtecture 9.0+ only and not on older 8.6 3090TI / RTX A6000)...
vLLM
bug showing tensor parallism not yet supported: https://github.com/vllm-project/vllm/issues/14449
fwiw i just tested the barwtowski Qwen_QwQ-32B-IQ4_XS.gguf
1-shot flappy bird using ik_llama.cpp fork and it looks good to me. Let us know how you work out! (It used ~14k context so you will want 16k minimum and preferably more).
Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.
Thank you for the explanation!
I'm using 3060 12gb and it is really important to optimize the size of Qs.
Q4KS with 12k context at Q8 give me a decent speed of 3.5-3.8 tps.
But Q4KM give me close to 3-3.2 tps.
:oof: yeah 12GB is not really enough for a 32B model - especially a CoT reasoning model which requires additional context length. You take a big penalty for not offloading everything onto GPU. Honestly consider trying the IQ2_XXS
and fill up the remaining VRAM with context (maybe 8k will fit?)... The IQ3_XXS
is already going to use more than 12GB, but might be worth a shot with just a few layers in RAM.
Have fun!