EuroBERT Collection Scaling Multilingual Encoders for European Languages • 4 items • Updated 3 days ago • 8
🧠 Reasoning datasets Collection Datasets with reasoning traces for math and code released by the community • 14 items • Updated 2 days ago • 100
view article Article From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub 30 days ago • 49
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4 • 203
view article Article Mastering Long Contexts in LLMs with KVPress By nvidia and 1 other • Jan 23 • 64
view article Article Fine-tune ModernBERT for RAG with Synthetic Data By sdiazlor and 2 others • Jan 20 • 37
view article Article Train 400x faster Static Embedding Models with Sentence Transformers Jan 15 • 159
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data Paper • 2410.01560 • Published Oct 2, 2024 • 4
Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP Paper • 2408.04303 • Published Aug 8, 2024 • 20
view article Article Fine-tune a SmolLM on domain-specific synthetic data from a LLM By davidberenstein1957 • Jan 3 • 36
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Paper • 2412.13663 • Published Dec 18, 2024 • 135
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning Paper • 2407.04078 • Published Jul 4, 2024 • 21