Why clamp qkv_states, is it common?
#44
by
jay68
- opened
In line 318 of modeling_dbrx.py, along with the "clip_qkv": 8 configuration, dbrx will clamp the value of qkv_states between -8 and 8.
Is such config only for inference or for both training and inference?
Why dbrx does this, is there some citation works?