HuggingFaceFW/fineweb
Viewer
β’
Updated
β’
25B
β’
313k
β’
2.03k
A collection of datasets for LLM pretraining
Note π· Web datasets
Note π Highly curated web datasets filtered using classifiers
Note π Highly curated math pages from CommonCrawl
Note π» Github code dataset
Note Synthetic textbooks
Note Contains Cosmopedia v2 (synthetic textbooks) and Python-Edu (educational Python code)