metadata
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- HuggingFaceTB/SmolVLM-256M-Instruct
pipeline_tag: image-text-to-text
SmolDocling-256M-preview
SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.
🚀 Features:
- 🏷️ DocTags for Efficient Tokenization – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments.
- 🔍 OCR (Optical Character Recognition) – Extracts text accurately from images.
- 📐 Layout and Localization – Preserves document structure and document element bounding boxes.
- 💻 Code Recognition – Detects and formats code blocks including identation.
- 🔢 Formula Recognition – Identifies and processes mathematical expressions.
- 📊 Chart Recognition – Extracts and interprets chart data.
- 📑 Table Recognition – Supports column and row headers for structured table extraction.
- 🖼️ Figure Classification – Differentiates figures and graphical elements.
- 📝 Caption Correspondence – Links captions to relevant images and figures.
- 📜 List Grouping – Organizes and structures list elements correctly.
- 📄 Full-Page Conversion – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
- 🔲 OCR with Bounding Boxes – OCR regions using a bounding box.
- 📂 General Document Processing – Trained for non-scientific documents and scientific.
- 🔄 Seamless Docling Integration – Import into Docling and export in multiple formats.
- 📚 Multi-Page & Full Document Conversion – Coming soon! 🚧
Repository: [More Information Needed] Paper [optional]: [More Information Needed] Demo [optional]: [More Information Needed]
Model Summary
- Developed by: Docling Team
- Model type: Multi-modal model (image+text)
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: Based on Idefics3 (see technical summary)
How to get started
You can use transformers or docling to perform inference:
Transformers:
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_texts[0])
"""