la-leaderboard (La Leaderboard)

posted an update 2 days ago

Post

1512

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

albertvillanova

posted an update 7 days ago

Post

3529

🚀 New smolagents update: Safer Local Python Execution! 🦾🐍

With the latest release, we've added security checks to the local Python interpreter: every evaluation is now analyzed for dangerous builtins, modules, and functions. 🔒

Here's why this matters & what you need to know! 🧵👇

1️⃣ Why is local execution risky? ⚠️
AI agents that run arbitrary Python code can unintentionally (or maliciously) access system files, run unsafe commands, or exfiltrate data.

2️⃣ New Safety Layer in smolagents 🛡️
We now inspect every return value during execution:
✅ Allowed: Safe built-in types (e.g., numbers, strings, lists)
⛔ Blocked: Dangerous functions/modules (e.g., os.system, subprocess, exec, shutil)

3️⃣ Immediate Benefits 💡
- Prevent agents from accessing unsafe builtins
- Block unauthorized file or network access
- Reduce accidental security vulnerabilities

4️⃣ Security Disclaimer ⚠️
🚨 Despite these improvements, local Python execution is NEVER 100% safe. 🚨
If you need true isolation, use a remote sandboxed executor like Docker or E2B.

5️⃣ The Best Practice: Use Sandboxed Execution 🔐
For production-grade AI agents, we strongly recommend running code in a Docker or E2B sandbox to ensure complete isolation.

6️⃣ Upgrade Now & Stay Safe! 🚀
Check out the latest smolagents release and start building safer AI agents today.

🔗 https://github.com/huggingface/smolagents

What security measures do you take when running AI-generated code? Let’s discuss! 👇

#AI #smolagents #Python #Security

2 replies

·

albertvillanova

posted an update 8 days ago

Post

3773

🚀 Big news for AI agents! With the latest release of smolagents, you can now securely execute Python code in sandboxed Docker or E2B environments. 🦾🔒

Here's why this is a game-changer for agent-based systems: 🧵👇

1️⃣ Security First 🔐
Running AI agents in unrestricted Python environments is risky! With sandboxing, your agents are isolated, preventing unintended file access, network abuse, or system modifications.

2️⃣ Deterministic & Reproducible Runs 📦
By running agents in containerized environments, you ensure that every execution happens in a controlled and predictable setting—no more environment mismatches or dependency issues!

3️⃣ Resource Control & Limits 🚦
Docker and E2B allow you to enforce CPU, memory, and execution time limits, so rogue or inefficient agents don’t spiral out of control.

4️⃣ Safer Code Execution in Production 🏭
Deploy AI agents confidently, knowing that any generated code runs in an ephemeral, isolated environment, protecting your host machine and infrastructure.

5️⃣ Easy to Integrate 🛠️
With smolagents, you can simply configure your agent to use Docker or E2B as its execution backend—no need for complex security setups!

6️⃣ Perfect for Autonomous AI Agents 🤖
If your AI agents generate and execute code dynamically, this is a must-have to avoid security pitfalls while enabling advanced automation.

⚡ Get started now: https://github.com/huggingface/smolagents

What will you build with smolagents? Let us know! 🚀💡

mariagrandury

updated a dataset 12 days ago

la-leaderboard/requests

Viewer • Updated 12 days ago • 63 • 2.45k

mariagrandury

updated a dataset 15 days ago

la-leaderboard/results

Updated 15 days ago • 1.68k

mariagrandury

in la-leaderboard/la-leaderboard 15 days ago

It takes too long to get reuslts?

2

#16 opened 23 days ago by

StarscreamDeceptions

Algunas preguntas sobre la velocidad de evaluación

1

#15 opened 29 days ago by

StarscreamDeceptions

mariagrandury

updated a Space 15 days ago

71

La Leaderboard

🌸

Evaluate open LLMs in the languages of LATAM and Spain.

mariagrandury

in la-leaderboard/la-leaderboard 15 days ago

runtime error

1

#17 opened 15 days ago by

TonySilva423

clefourrier

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 203

albertvillanova

posted an update about 1 month ago

Post

3776

🚀 Introducing @huggingface Open Deep-Research💥

In just 24 hours, we built an open-source agent that:
✅ Autonomously browse the web
✅ Search, scroll & extract info
✅ Download & manipulate files
✅ Run calculations on data

55% on GAIA validation set! Help us improve it!💡
https://huggingface.co/blog/open-deep-research

3 replies

·

javicond3

authored 5 papers about 2 months ago

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

Paper • 2406.17789 • Published May 28, 2024 • 1

How Stable is Stable Diffusion under Recursive InPainting (RIP)?

Paper • 2407.09549 • Published Jun 27, 2024 • 1

Using large language models to estimate features of multi-word expressions: Concreteness, valence, arousal

Paper • 2408.16012 • Published Aug 16, 2024

Evaluating Large Language Models with Tests of Spanish as a Foreign Language: Pass or Fail?

Paper • 2409.15334 • Published Sep 8, 2024 • 1

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

Paper • 2501.09775 • Published Jan 16 • 29

gonzmart

authored a paper about 2 months ago

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

Paper • 2501.09775 • Published Jan 16 • 29

reviriego

authored a paper about 2 months ago

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

Paper • 2501.09775 • Published Jan 16 • 29

albertvillanova

posted an update 2 months ago

Post

2102

Discover all the improvements in the new version of Lighteval: https://huggingface.co/docs/lighteval/

clefourrier

authored a paper 3 months ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 18

La Leaderboard

AI & ML interests

Recent Activity