CodeModernBERT-Owl

概要 / Overview

🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル

CodeModernBERT-Owl is a pretrained model designed from scratch for code search and code understanding tasks.

Compared to previous versions such as CodeHawks-ModernBERT and CodeMorph-ModernBERT, this model now supports Rust and improves search accuracy in Python, PHP, Java, JavaScript, Go, and Ruby.

🛠 主な特徴 / Key Features

Supports long sequences up to 2048 tokens (compared to Microsoft's 512-token models)
Optimized for code search, code understanding, and code clone detection
Fine-tuned on GitHub open-source repositories (Java, Rust)
Achieves the highest accuracy among the CodeHawks/CodeMorph series
Multi-language support: Python, PHP, Java, JavaScript, Go, Ruby, and Rust


📊 モデルパラメータ / Model Parameters

パラメータ / Parameter 値 / Value
vocab_size 50,000
hidden_size 768
num_hidden_layers 12
num_attention_heads 12
intermediate_size 3,072
max_position_embeddings 2,048
type_vocab_size 2
hidden_dropout_prob 0.1
attention_probs_dropout_prob 0.1
local_attention_window 128
rope_theta 160,000
local_attention_rope_theta 10,000

💻 モデルの使用方法 / How to Use

This model can be easily loaded using the HF中国镜像站 Transformers library.

⚠️ Requires transformers >= 4.48.0

🔗 Colab Demo (Replace with "CodeModernBERT-Owl")

モデルのロード / Load the Model

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")

コード埋め込みの取得 / Get Code Embeddings

import torch

def get_embedding(text, model, tokenizer, device="cuda"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    if "token_type_ids" in inputs:
        inputs.pop("token_type_ids")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.model(**inputs)
    embedding = outputs.last_hidden_state[:, 0, :]
    return embedding

embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)

🔍 評価結果 / Evaluation Results

データセット / Dataset

📌 Tested on code_x_glue_ct_code_to_text with a candidate pool size of 100.
📌 Rust-specific evaluations were conducted using Shuu12121/rust-codesearch-dataset-open.


📈 主要な評価指標の比較(同一シード値)/ Key Evaluation Metrics (Same Seed)

言語 / Language CodeModernBERT-Owl CodeHawks-ModernBERT Salesforce CodeT5+ Microsoft CodeBERT GraphCodeBERT
Python 0.8793 0.8551 0.8266 0.5243 0.5493
Java 0.8880 0.7971 0.8867 0.3134 0.5879
JavaScript 0.8423 0.7634 0.7628 0.2694 0.5051
PHP 0.9129 0.8578 0.9027 0.2642 0.6225
Ruby 0.8038 0.7469 0.7568 0.3318 0.5876
Go 0.9386 0.9043 0.8117 0.3262 0.4243

Achieves the highest accuracy in all target languages.
Significantly improved Java accuracy using additional fine-tuned GitHub data.
Outperforms previous models, especially in PHP and Go.


📊 Rust (独自データセット) / Rust Performance

指標 / Metric CodeModernBERT-Owl
MRR 0.7940
MAP 0.7940
R-Precision 0.7173

📌 K別評価指標 / Evaluation Metrics by K

K Recall@K Precision@K NDCG@K F1@K Success Rate@K Query Coverage@K
1 0.7173 0.7173 0.7173 0.7173 0.7173 0.7173
5 0.8913 0.7852 0.8118 0.8132 0.8913 0.8913
10 0.9333 0.7908 0.8254 0.8230 0.9333 0.9333
50 0.9887 0.7938 0.8383 0.8288 0.9887 0.9887
100 1.0000 0.7940 0.8401 0.8291 1.0000 1.0000

📝 結論 / Conclusion

Top performance in all languages
Rust support successfully added through dataset augmentation
Further performance improvements possible with better datasets


📜 ライセンス / License

📄 Apache 2.0

📧 連絡先 / Contact

📩 For any questions, please contact:
📧 [email protected]

Downloads last month
34
Safetensors
Model size
152M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support sentence-similarity models for transformers library.

Datasets used to train Shuu12121/CodeModernBERT-Owl