Text Generation
Transformers
Safetensors
dbrx
conversational
text-generation-inference

Why clamp qkv_states, is it common?

#44
by jay68 - opened

In line 318 of modeling_dbrx.py, along with the "clip_qkv": 8 configuration, dbrx will clamp the value of qkv_states between -8 and 8.
Is such config only for inference or for both training and inference?
Why dbrx does this, is there some citation works?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment