File size: 15,118 Bytes
efb3e16 3d3bb6a a6a2fd8 5be1f5a 3d3bb6a efb3e16 cd2c4cb 3d3bb6a a6a2fd8 3d3bb6a c1a43cf b8d9a03 3d3bb6a 5196877 efb3e16 904b2d3 60e3cdc ec6ba0f a3dd76f efb3e16 4634ad6 efb3e16 bc66c00 904b2d3 bc66c00 a4c943f 60e3cdc a4c943f 1074b94 a4c943f 1074b94 8ef36d9 1074b94 a4c943f 1074b94 b742d1c e3c14a7 a4c943f 60e3cdc a4c943f e3c14a7 b742d1c 8ef36d9 b742d1c a4c943f bc66c00 1074b94 bc66c00 a3dd76f 60e3cdc bc66c00 a3dd76f 8ef36d9 af4ca08 bc66c00 60e3cdc a3dd76f bc66c00 af4ca08 bc66c00 a6a2fd8 bc66c00 a3dd76f bc66c00 a3dd76f 8ef36d9 a3dd76f 8ef36d9 60e3cdc bc66c00 8ef36d9 af4ca08 cb6aeee 82305a1 04e22bf 56b796c af4ca08 904b2d3 60e3cdc 904b2d3 fd16117 904b2d3 fd16117 904b2d3 fd16117 60e3cdc 904b2d3 fd16117 60e3cdc 904b2d3 fd16117 60e3cdc 904b2d3 fd16117 60e3cdc 904b2d3 fd16117 904b2d3 60e3cdc 904b2d3 60e3cdc 904b2d3 60e3cdc 904b2d3 60e3cdc 904b2d3 23d43f8 06310a9 a6ad18b 06310a9 95f09ca 06310a9 904b2d3 a3dd76f 34a37cc a3dd76f a6a2fd8 34a37cc 9086eeb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 |
---
base_model:
- HuggingFaceTB/SmolVLM-256M-Instruct
language:
- en
library_name: transformers
license: cdla-permissive-2.0
pipeline_tag: image-text-to-text
---
<div style="display: flex; align-items: center;">
<img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/SmolDocling_doctags1.png" alt="SmolDocling" style="width: 200px; height: auto; margin-right: 20px;">
<div>
<h3>SmolDocling-256M-preview</h3>
<p>SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for <strong>DoclingDocuments</strong>.</p>
</div>
</div>
This model was presented in the paper [SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion](https://huggingface.co/papers/2503.11576).
### 🚀 Features:
- 🏷️ **DocTags for Efficient Tokenization** – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**.
- 🔍 **OCR (Optical Character Recognition)** – Extracts text accurately from images.
- 📐 **Layout and Localization** – Preserves document structure and document element **bounding boxes**.
- 💻 **Code Recognition** – Detects and formats code blocks including identation.
- 🔢 **Formula Recognition** – Identifies and processes mathematical expressions.
- 📊 **Chart Recognition** – Extracts and interprets chart data.
- 📑 **Table Recognition** – Supports column and row headers for structured table extraction.
- 🖼️ **Figure Classification** – Differentiates figures and graphical elements.
- 📝 **Caption Correspondence** – Links captions to relevant images and figures.
- 📜 **List Grouping** – Organizes and structures list elements correctly.
- 📄 **Full-Page Conversion** – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
- 🔲 **OCR with Bounding Boxes** – OCR regions using a bounding box.
- 📂 **General Document Processing** – Trained for both scientific and non-scientific documents.
- 🔄 **Seamless Docling Integration** – Import into **Docling** and export in multiple formats.
- 💨 **Fast inference using VLLM** – Avg of 0.35 secs per page on A100 GPU.
### 🚧 *Coming soon!*
- 📊 **Better chart recognition 🛠️**
- 📚 **One shot multi-page inference ⏱️**
- 🧪 **Chemical Recognition**
- 📙 **Datasets**
## ⌨️ Get started (code examples)
You can use **transformers**, **vllm**, or **onnx** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert results to variety of output formats (md, html, etc.):
<details>
<summary>📄 Single page image inference using Tranformers 🤖</summary>
```python
# Prerequisites:
# pip install torch
# pip install docling_core
# pip install transformers
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
from pathlib import Path
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
# output_path_html = Path("Out/") / "example.html"
# doc.save_as_html(output_filoutput_path_htmle_path)
# MD
print(doc.export_to_markdown())
```
</details>
<details>
<summary> 🚀 Fast Batch Inference Using VLLM</summary>
```python
# Prerequisites:
# pip install vllm
# pip install docling_core
# place page images you want to convert into "img/" dir
import time
import os
from vllm import LLM, SamplingParams
from PIL import Image
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from pathlib import Path
# Configuration
MODEL_PATH = "ds4sd/SmolDocling-256M-preview"
IMAGE_DIR = "img/" # Place your page images here
OUTPUT_DIR = "out/"
PROMPT_TEXT = "Convert page to Docling."
# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Initialize LLM
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192)
chat_template = f"<|im_start|>User:<image>{PROMPT_TEXT}<end_of_utterance>
Assistant:"
image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))])
start_time = time.time()
total_tokens = 0
for idx, img_file in enumerate(image_files, 1):
img_path = os.path.join(IMAGE_DIR, img_file)
image = Image.open(img_path).convert("RGB")
llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}}
output = llm.generate([llm_input], sampling_params=sampling_params)[0]
doctags = output.outputs[0].text
img_fn = os.path.splitext(img_file)[0]
output_filename = img_fn + ".dt"
output_path = os.path.join(OUTPUT_DIR, output_filename)
with open(output_path, "w", encoding="utf-8") as f:
f.write(doctags)
# To convert to Docling Document, MD, HTML, etc.:
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
# output_path_html = Path(OUTPUT_DIR) / f"{img_fn}.html"
# doc.save_as_html(output_path_html)
# MD
output_path_md = Path(OUTPUT_DIR) / f"{img_fn}.md"
doc.save_as_markdown(output_path_md)
print(f"Total time: {time.time() - start_time:.2f} sec")
```
</details>
<details>
<summary> ONNX Inference</summary>
```python
# Prerequisites:
# pip install onnxruntime
# pip install onnxruntime-gpu
from transformers import AutoConfig, AutoProcessor
from transformers.image_utils import load_image
import onnxruntime
import numpy as np
import os
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
os.environ["OMP_NUM_THREADS"] = "1"
# cuda
os.environ["ORT_CUDA_USE_MAX_WORKSPACE"] = "1"
# 1. Load models
## Load config and processor
model_id = "ds4sd/SmolDocling-256M-preview"
config = AutoConfig.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
## Load sessions
# !wget https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/onnx/vision_encoder.onnx
# !wget https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/onnx/embed_tokens.onnx
# !wget https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/onnx/decoder_model_merged.onnx
# cpu
# vision_session = onnxruntime.InferenceSession("vision_encoder.onnx")
# embed_session = onnxruntime.InferenceSession("embed_tokens.onnx")
# decoder_session = onnxruntime.InferenceSession("decoder_model_merged.onnx"
# cuda
vision_session = onnxruntime.InferenceSession("vision_encoder.onnx", providers=["CUDAExecutionProvider"])
embed_session = onnxruntime.InferenceSession("embed_tokens.onnx", providers=["CUDAExecutionProvider"])
decoder_session = onnxruntime.InferenceSession("decoder_model_merged.onnx", providers=["CUDAExecutionProvider"])
## Set config values
num_key_value_heads = config.text_config.num_key_value_heads
head_dim = config.text_config.head_dim
num_hidden_layers = config.text_config.num_hidden_layers
eos_token_id = config.text_config.eos_token_id
image_token_id = config.image_token_id
end_of_utterance_id = processor.tokenizer.convert_tokens_to_ids("<end_of_utterance>")
# 2. Prepare inputs
## Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
## Load image and apply processor
image = load_image("https://ibm.biz/docling-page-with-table")
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="np")
## Prepare decoder inputs
batch_size = inputs['input_ids'].shape[0]
past_key_values = {
f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
for layer in range(num_hidden_layers)
for kv in ('key', 'value')
}
image_features = None
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
position_ids = np.cumsum(inputs['attention_mask'], axis=-1)
# 3. Generation loop
max_new_tokens = 8192
generated_tokens = np.array([[]], dtype=np.int64)
for i in range(max_new_tokens):
inputs_embeds = embed_session.run(None, {'input_ids': input_ids})[0]
if image_features is None:
## Only compute vision features if not already computed
image_features = vision_session.run(
['image_features'], # List of output names or indices
{
'pixel_values': inputs['pixel_values'],
'pixel_attention_mask': inputs['pixel_attention_mask'].astype(np.bool_)
}
)[0]
## Merge text and vision embeddings
inputs_embeds[inputs['input_ids'] == image_token_id] = image_features.reshape(-1, image_features.shape[-1])
logits, *present_key_values = decoder_session.run(None, dict(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
**past_key_values,
))
## Update values for next generation loop
input_ids = logits[:, -1].argmax(-1, keepdims=True)
attention_mask = np.ones_like(input_ids)
position_ids = position_ids[:, -1:] + 1
for j, key in enumerate(past_key_values):
past_key_values[key] = present_key_values[j]
generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
if (input_ids == eos_token_id).all() or (input_ids == end_of_utterance_id).all():
break # Stop predicting
doctags = processor.batch_decode(
generated_tokens,
skip_special_tokens=False,
)[0].lstrip()
print(doctags)
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
print(doc.export_to_markdown())
```
</details>
💻 Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ds4sd/SmolDocling-256M-preview-mlx-bf16)
## DocTags
<img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/doctags_v2.png" width="800" height="auto" alt="Image description">
DocTags create a clear and structured system of tags and rules that separate text from the document's structure. This makes things easier for Image-to-Sequence models by reducing confusion. On the other hand, converting directly to formats like HTML or Markdown can be messy—it often loses details, doesn’t clearly show the document’s layout, and increases the number of tokens, making processing less efficient.
DocTags are integrated with Docling, which allows export to HTML, Markdown, and JSON. These exports can be offloaded to the CPU, reducing token generation overhead and improving efficiency.
## Supported Instructions
<table>
<tr>
<td><b>Description</b></td>
<td><b>Instruction</b></td>
<td><b>Comment</b></td>
</tr>
<tr>
<td><b>Full conversion</b></td>
<td>Convert this page to docling.</td>
<td>DocTags represetation</td>
</tr>
<tr>
<td><b>Chart</b></td>
<td>Convert chart to table.</td>
<td>(e.g., <chart>)</td>
</tr>
<tr>
<td><b>Formula</b></td>
<td>Convert formula to LaTeX.</td>
<td>(e.g., <formula>)</td>
</tr>
<tr>
<td><b>Code</b></td>
<td>Convert code to text.</td>
<td>(e.g., <code>)</td>
</tr>
<tr>
<td><b>Table</b></td>
<td>Convert table to OTSL.</td>
<td>(e.g., <otsl>) OTSL: <a href="https://arxiv.org/pdf/2305.03393">Lysak et al., 2023</a></td>
</tr>
<tr>
<td rowspan=4><b>Actions and Pipelines</b></td>
<td>OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237></td>
<td></td>
</tr>
<tr>
<td>Identify element at: <loc_247><loc_482><10c_252><loc_486></td>
<td></td>
</tr>
<tr>
<td>Find all 'text' elements on the page, retrieve all section headers.</td>
<td></td>
</tr>
<tr>
<td>Detect footer elements on the page.</td>
<td></td>
</tr>
</table>
#### Model Summary
- **Developed by:** Docling Team, IBM Research
- **Model type:** Multi-modal model (image+text)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Architecture:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)
- **Finetuned from model:** Based on [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct)
**Repository:** [Docling](https://github.com/docling-project/docling)
**Paper:** [arXiv](https://arxiv.org/abs/2503.11576)
**Project Page:** [HF中国镜像站](https://huggingface.co/ds4sd/SmolDocling-256M-preview)
**Citation:**
```
@misc{nassar2025smoldoclingultracompactvisionlanguagemodel,
title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion},
author={Ahmed Nassar and Andres Marafioti and Matteo Omenetti and Maksym Lysak and Nikolaos Livathinos and Christoph Auer and Lucas Morin and Rafael Teixeira de Lima and Yusik Kim and A. Said Gurbuz and Michele Dolfi and Miquel Farré and Peter W. J. Staar},
year={2025},
eprint={2503.11576},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.11576},
}
```
**Demo:** [HF Space](https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo) |