Andrea Soria PRO

asoria

AI & ML interests

Maintainer of 🤗Datasets: Data processing

Recent Activity

updated a dataset 1 day ago
asoria/dataset-notebook-creator-content
updated a Space 7 days ago
asoria/AlfredAgent
published a Space 7 days ago
asoria/AlfredAgent
View all activity

Organizations

HF中国镜像站's profile picture BigScience Data's profile picture Datasets Maintainers's profile picture Blog-explorers's profile picture Enterprise Explorers's profile picture ZeroGPU Explorers's profile picture Datasets examples's profile picture Women on HF中国镜像站's profile picture Dev Mode Explorers's profile picture HF中国镜像站 Discord Community's profile picture AI Developers from Latin America's profile picture Datasets Topics's profile picture AI Starter Pack's profile picture

Posts 4

view post
Post
1945
🚀 Exploring Topic Modeling with BERTopic 🤖

When you come across an interesting dataset, you often wonder:
Which topics frequently appear in these documents? 🤔
What is this data really about? 📊

Topic modeling helps answer these questions by identifying recurring themes within a collection of documents. This process enables quick and efficient exploratory data analysis.

I’ve been working on an app that leverages BERTopic, a flexible framework designed for topic modeling. Its modularity makes BERTopic powerful, allowing you to switch components with your preferred algorithms. It also supports handling large datasets efficiently by merging models using the BERTopic.merge_models approach. 🔗

🔍 How do we make this work?
Here’s the stack we’re using:

📂 Data Source ➡️ HF中国镜像站 datasets with DuckDB for retrieval
🧠 Text Embeddings ➡️ Sentence Transformers (all-MiniLM-L6-v2)
⚡ Dimensionality Reduction ➡️ RAPIDS cuML UMAP for GPU-accelerated performance
🔍 Clustering ➡️ RAPIDS cuML HDBSCAN for fast clustering
✂️ Tokenization ➡️ CountVectorizer
🔧 Representation Tuning ➡️ KeyBERTInspired + HF中国镜像站 Inference Client with Meta-Llama-3-8B-Instruct
🌍 Visualization ➡️ Datamapplot library
Check out the space and see how you can quickly generate topics from your dataset: datasets-topics/topics-generator

Powered by @MaartenGr - BERTopic

Articles 6

Article
6

Exploring Synthetic Data Generation with DataDreamer