Pretrained Reward Model Classifier

Overview

This is a specialized binary classifier that evaluates text chunks and predicts whether they would be "Chosen" (A) or "Rejected" (B).

How It Works

Text is split into exact 64-token chunks using the Qwen 2.5 tokenizer
The model evaluates the preference between chunks in a specific format
Only token IDs [32, 33] have non-zero weights in the LM head (A=Chosen, B=Rejected)

Input Format

The model expects input in this precise format:

[Original text from previous 64-token chunks]

<<JUDGEMENT_REGION>>
[Next 64-token chunk to evaluate]
<</JUDGEMENT_REGION>>

<<JUDGEMENT>>

Example

Original paragraph:

The city council meeting started promptly at 6 PM with all members present. Mayor Johnson opened by addressing concerns about the new parking regulations downtown. Citizens expressed both support and opposition during the public comment period. The council ultimately voted 4-2 to implement the regulations starting next month.

Formatted for prediction:

The city council meeting started promptly at 6 PM with all members present. Mayor Johnson opened by addressing concerns about the new parking regulations downtown.

<<JUDGEMENT_REGION>>
Citizens expressed both support and opposition during the public comment period. The council ultimately voted 4-2 to implement the regulations starting next month.
<</JUDGEMENT_REGION>>

<<JUDGEMENT>>

Output

The model predicts whether the chunk in the JUDGEMENT_REGION would be:

A: Chosen (preferred content)
B: Rejected (less preferred content)

The prediction is based on the relative probabilities between these two tokens only.

Analysis

For practical use, results should be aggregated by taking the mean of log ratios between the two probabilities:

log_ratio = log(P(A) / P(B))

This log ratio approach provides a more stable and interpretable signal across multiple evaluations than using raw probabilities alone.

Quest-AI
/

pretrain-rm-baseline-7b