
HuggingFaceFW
AI & ML interests
None defined yet.
Recent Activity
🤗 HuggingFace 🍷 FineWeb datasets
Read our technical report!
This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web (CommonCrawl), released under a permissive license (ODC-By).
The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.
All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the 🤗 libraries datatrove
, nanotron
or lighteval
.
Version 1 of the 🍷 FineWeb dataset is available here. Our ablation models can be found here.
Version 2 of the 🥂 FineWeb dataset (multilingual extension to +1800 languages/script) is available here.
Collections
5
-
868
FineWeb: decanting the web for the finest text data at scale
🍷Generate high-quality web text data for LLM training
-
HuggingFaceFW/fineweb
Viewer • Updated • 25B • 313k • 2.03k -
HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.3B • 516k • 649 -
HuggingFaceFW/fineweb-edu-score-2
Viewer • Updated • 13.1B • 162k • 72
spaces
5
Discussion Forum
FineWeb: decanting the web for the finest text data at scale
Generate high-quality web text data for LLM training
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
Evaluate multilingual models using FineTasks
Tasks Explorer
Datasets Metrics Explorer
models
30

HuggingFaceFW/fineweb-edu-classifier

HuggingFaceFW/Datasets-Metrics-Viewer-Data

HuggingFaceFW/ablation-model-fineweb-edu

HuggingFaceFW/ablation-exp-filter-custom-all_filters-28BT

HuggingFaceFW/ablation-exp-filter-custom-line_char_duplicated_0.01-28BT

HuggingFaceFW/ablation-exp-filter-custom-line_ratio_0.67-28BT

HuggingFaceFW/ablation-exp-filter-custom-lines_punct_0.12-28BT

HuggingFaceFW/ablation-exp-filter-baseline_c4-28BT

HuggingFaceFW/ablation-exp-filter-baseline_cc-28BT
