Model Card for Model ID

Regression model which predicts difficulty score for an input text. Predicted scores can be mapped to CEFR levels.

Model Details

Frozen BERT-large layers with a regressor on top. Trained on a mix of manually annotated datasets (more details on data will follow).

How to Get Started with the Model

Use the code below to get started with the model.

class CustomModel(BertPreTrainedModel):
    def __init__(self, config, load_path=None, use_auth_token: str = None,):
        super().__init__(config)
        self.bert = BertModel(config)
        self.pre_classifier = nn.Linear(config.hidden_size, 128)
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(128, 1)

        # Apply Xavier initialization
        nn.init.xavier_uniform_(self.pre_classifier.weight)
        nn.init.xavier_uniform_(self.classifier.weight)
        if self.pre_classifier.bias is not None:
            nn.init.constant_(self.pre_classifier.bias, 0)
        if self.classifier.bias is not None:
            nn.init.constant_(self.classifier.bias, 0)

    
    def forward(
            self,
            input_ids,
            labels=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
        )


        pooled_output = outputs[0][:, 0]
        pooled_output = self.pre_classifier(pooled_output)
        pooled_output = nn.ReLU()(pooled_output)  
        pooled_output = self.dropout(pooled_output)  
        logits = self.classifier(pooled_output) 

        if labels is not None:
            loss_fn = nn.MSELoss()
            loss = loss_fn(logits.view(-1), labels.view(-1))
            return loss, logits
        else:
            return None, logits


tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
config.num_labels = 1

model = CustomModel(config)
model.load_state_dict(torch.load(f'{model_path}/pytorch_model.bin'))

inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
inputs = {key: value.to(device) for key, value in inputs.items()}
        
with torch.no_grad():
  _, logits = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], token_type_ids=inputs["token_type_ids"])

To map to CEFR, use:

reg2cl2 = {'1.0': 'A1', '1.5': 'A12', '2.0': 'A2', '2.5': 'A2', '3.0': 'B1', '3.5': 'B12', '4.0': 'B2', '4.5': 'B2', '5.0': 'C1', '5.5': 'C12', '6.0': 'C2', '0.0': 'A1'}
print("Predicted output (logits):", logits.item(), reg2cl2[str(float(round(logits.item())))])

Training Details

Training Hyperparameters

  • learning_rate: 3e-4
  • num_train_epochs: 15.0
  • batch_size: 32
  • weight_decay: 0.1
  • adam_beta1: 0.9
  • adam_beta2: 0.99
  • adam_epsilon: 1e-8
  • max_grad_norm: 1.0
  • fp16: True

Evaluation on test set

Evaluation results

Citation

Please refer to this repo when using the model.

Downloads last month
18
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for Askinkaty/RuBERT_text_difficulty

Finetuned
(10)
this model