SeerAttention/SeerAttention-DeepSeek-R1-Distill-Qwen-14B-Decode-AttnGates

This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B Model. It's only used for decoding. However, the current inference framework is unoptimized and only for accuracy tests.

SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the block-wise attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.

AIME

threshold	0.005	0.001	Dense
Acc	73.33	73.33	70
Sparsity	86%	61.68%	0

SeerAttention
/

SeerAttention-DeepSeek-R1-Distill-Qwen-14B-Decode-AttnGates

AIME

Model tree for SeerAttention/SeerAttention-DeepSeek-R1-Distill-Qwen-14B-Decode-AttnGates