KRLabsOrg
/

lettucedect-base-modernbert-en-v1

@@ -70,9 +70,11 @@ print("Predictions:", predictions)
 ```
-## Details
-We evaluate our model on the test set of the [RAGTruth](https://aclanthology.org/2024.acl-long.585/) dataset. We evaluate both example-level (can we detect that a given answer contains hallucinations) and span-level (can we detect which parts of the answer are hallucinated).
 The results on the example-level can be seen in the table below.
@@ -80,18 +82,14 @@ The results on the example-level can be seen in the table below.
   <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/example_level_lettucedetect.png?raw=true" alt="Example-level Results" width="800"/>
 </p>
-Our large model consistently achieves the highest scores across all data types and overall (**lettucedetect-large-v1**), the base model is also competitive across the benchmark (**lettucedetect-base-v**). We beat the previous best model (Finetuned LLAMA-2-13B) while being significantly smaller and faster (our models are 150M and 396M parameters, respectively, and able to process 30-60 examples per second on a single A100 GPU).
-The other non-prompt based model is [Luna](https://aclanthology.org/2025.coling-industry.34.pdf) which is also a token-level model but uses a DeBERTA-large encoder model. Our models are overall better that the Luna architecture (65.4 vs 76.07 F1 score for the **base** model on the _overall_ data type).
-The span-level results can be seen in the table below.
 <p align="center">
   <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/span_level_lettucedetect.png?raw=true" alt="Span-level Results" width="800"/>
 </p>
-Our models achieve the best scores throughout each data-type and also overall, beating the previous best model (Finetuned LLAMA-2-13B) by a significant margin.
 ## Citing
 If you use the model or the tool, please cite the following:

 ```
+## Performance
+**Example level results**
+We evaluate our model on the test set of the [RAGTruth](https://aclanthology.org/2024.acl-long.585/) dataset. Our large model, **lettucedetect-large-v1**, achieves an overall F1 score of 79.22%, outperforming prompt-based methods like GPT-4 (63.4%) and encoder-based models like [Luna](https://aclanthology.org/2025.coling-industry.34.pdf) (65.4%). It also surpasses fine-tuned LLAMA-2-13B (78.7%) (presented in [RAGTruth](https://aclanthology.org/2024.acl-long.585/)) and is competitive with the SOTA fine-tuned LLAMA-3-8B (83.9%) (presented in the [RAG-HAT paper](https://aclanthology.org/2024.emnlp-industry.113.pdf)). Overall, **lettucedetect-large-v1** and **lettucedect-base-v1** are very performant models, while being very effective in inference settings.
 The results on the example-level can be seen in the table below.
   <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/example_level_lettucedetect.png?raw=true" alt="Example-level Results" width="800"/>
 </p>
+**Span-level results**
+At the span level, our model achieves the best scores across all data types, significantly outperforming previous models. The results can be seen in the table below. Note that here we don't compare to models, like [RAG-HAT](https://aclanthology.org/2024.emnlp-industry.113.pdf), since they have no span-level evaluation presented.
 <p align="center">
   <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/span_level_lettucedetect.png?raw=true" alt="Span-level Results" width="800"/>
 </p>
 ## Citing
 If you use the model or the tool, please cite the following: