Is it bitnet {-1,0,1}?

#6
by Remek - opened

I looked through many bitnet1.58 implementations and noticed that they all use the method suggested in "The Era from 1-bit LLMs: Training Tips, Code and FAQ". The weights of the models that are currently trained according to this recipe are not numbers in the set {-1, 0, 1} and values in the interval (0,1). Is this the way it should be?

  1. The formula describing the quanztization of weights ("The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits").
  2. Implementation proposal ("The Era of 1-bit LLMs: Training Tips, Code and FAQ").
  3. Weights quantization test.
  4. Model during training.

1.58bitnet.jpg

I think you're correct Remek. The scaling factor is only used inside of the RoundClip, not after it. I guess there is an error in their code.

And maybe this is why this model has such a big size of weights using fp32, which shouldn't be the case since its parameters are integers.

Also curious about it.

I think the reason is that, they are using high precision gemm to simulate the low precision forward process (int8 activation * ternary weight).
We can go back to the original paper, bitnet(2310.11453), eq.11.
image.png
The correct process during linear forward is:

     # notice the following quant func will not include `/scale`
     x_quant, x_scaling_factor = activation_quant(x)
     w_quant, w_scaling_factor = weight_quant(w)
     output = low_precision_compute(x_quant, w_quant)
     output = output / x_scaling_factor / w_scaling_factor

Now, they are using high precision compute function to simulate this process. And in this way, we can convert the previous process like this:

     # notice the following quant func will not include `/scale`
     x_quant, x_scaling_factor = activation_quant(x)
     x_quant = x_quant / x_scaling_factor
     w_quant, w_scaling_factor = weight_quant(w)
     w_quant = w_quant / w_scaling_factor
     output = high_precision_compute(x_quant, w_quant)
     # output = output / x_scaling_factor / w_scaling_factor

So I think their implementation is consistent with their claim.

Yes, I think so, EasonWei! Just for convenience in dequantization after multiplication. The simulation assigns the scaling factors in the quantization process, respectively.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment