File size: 9,858 Bytes
efb3e16
 
3d3bb6a
 
 
 
 
 
efb3e16
 
cd2c4cb
 
 
 
 
 
 
3d3bb6a
 
 
 
 
 
 
 
 
 
 
 
c1a43cf
 
b8d9a03
3d3bb6a
5196877
efb3e16
904b2d3
 
 
60e3cdc
ec6ba0f
 
efb3e16
a3dd76f
efb3e16
b7182b2
efb3e16
bc66c00
904b2d3
bc66c00
 
a4c943f
 
 
60e3cdc
a4c943f
1074b94
a4c943f
 
1074b94
 
 
 
 
 
a4c943f
1074b94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b742d1c
 
 
 
 
 
 
 
e3c14a7
a4c943f
60e3cdc
a4c943f
 
e3c14a7
b742d1c
 
 
60e3cdc
b742d1c
a4c943f
bc66c00
 
 
1074b94
bc66c00
 
 
 
a3dd76f
 
 
60e3cdc
bc66c00
 
 
 
 
a3dd76f
 
af4ca08
bc66c00
 
60e3cdc
a3dd76f
bc66c00
af4ca08
bc66c00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a3dd76f
 
 
bc66c00
 
 
a3dd76f
 
 
 
 
 
 
 
60e3cdc
a3dd76f
 
 
60e3cdc
bc66c00
 
 
 
af4ca08
82305a1
 
04e22bf
56b796c
 
af4ca08
904b2d3
 
 
 
 
 
60e3cdc
904b2d3
 
fd16117
904b2d3
fd16117
904b2d3
 
fd16117
60e3cdc
 
904b2d3
 
fd16117
60e3cdc
 
904b2d3
 
fd16117
60e3cdc
 
904b2d3
 
fd16117
60e3cdc
 
904b2d3
 
fd16117
904b2d3
60e3cdc
904b2d3
 
 
60e3cdc
904b2d3
 
 
60e3cdc
904b2d3
 
 
60e3cdc
904b2d3
 
23d43f8
06310a9
 
a6ad18b
06310a9
 
 
 
 
904b2d3
a3dd76f
34a37cc
a3dd76f
34a37cc
 
 
 
 
 
 
 
 
 
 
 
a3dd76f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- HuggingFaceTB/SmolVLM-256M-Instruct
pipeline_tag: image-text-to-text
---

<div style="display: flex; align-items: center;">
    <img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/SmolDocling_doctags1.png" alt="SmolDocling" style="width: 200px; height: auto; margin-right: 20px;">
    <div>
        <h3>SmolDocling-256M-preview</h3>
        <p>SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for <strong>DoclingDocuments</strong>.</p>
    </div>
</div>

### 🚀 Features:  
- 🏷️ **DocTags for Efficient Tokenization** – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**.  
- 🔍 **OCR (Optical Character Recognition)** – Extracts text accurately from images.  
- 📐 **Layout and Localization** – Preserves document structure and document element **bounding boxes**.  
- 💻 **Code Recognition** – Detects and formats code blocks including identation.  
- 🔢 **Formula Recognition** – Identifies and processes mathematical expressions.  
- 📊 **Chart Recognition** – Extracts and interprets chart data.  
- 📑 **Table Recognition** – Supports column and row headers for structured table extraction.  
- 🖼️ **Figure Classification** – Differentiates figures and graphical elements.  
- 📝 **Caption Correspondence** – Links captions to relevant images and figures.  
- 📜 **List Grouping** – Organizes and structures list elements correctly.  
- 📄 **Full-Page Conversion** – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.) 
- 🔲 **OCR with Bounding Boxes** – OCR regions using a bounding box.
- 📂 **General Document Processing** – Trained for both scientific and non-scientific documents.  
- 🔄 **Seamless Docling Integration** – Import into **Docling** and export in multiple formats.
- 💨 **Fast inference using VLLM** – Avg of 0.35 secs per page on A100 GPU.

### 🚧 *Coming soon!*
- 📊 **Better chart recognition 🛠️**
- 📚 **One shot multi-page inference ⏱️**
- 🧪 **Chemical Recognition**
- 📙 **Datasets**


## ⌨️ Get started (code examples)

You can use **transformers** or **vllm** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert results to variety of ourput formats (md, html, etc.):

<details>
<summary>📄 Single page image inference using Tranformers 🤖</summary>

```python
# Prerequisites:
# pip install torch
# pip install docling_core
# pip install transformers

import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
    "ds4sd/SmolDocling-256M-preview",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)

# export as any format
# HTML
# doc.save_as_html(output_file)
# MD
print(doc.export_to_markdown())
```
</details>


<details>
<summary> 🚀 Fast Batch Inference Using VLLM</summary>

```python
# Prerequisites:
# pip install vllm
# pip install docling_core
# place page images you want to convert into "img/" dir

import time
import os
from vllm import LLM, SamplingParams
from PIL import Image
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument

# Configuration
MODEL_PATH = "ds4sd/SmolDocling-256M-preview"
IMAGE_DIR = "img/"  # Place your page images here
OUTPUT_DIR = "out/"
PROMPT_TEXT = "Convert page to Docling."

# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Initialize LLM
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192)

chat_template = f"<|im_start|>User:<image>{PROMPT_TEXT}<end_of_utterance>\nAssistant:"

image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))])

start_time = time.time()
total_tokens = 0

for idx, img_file in enumerate(image_files, 1):
    img_path = os.path.join(IMAGE_DIR, img_file)
    image = Image.open(img_path).convert("RGB")

    llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}}
    output = llm.generate([llm_input], sampling_params=sampling_params)[0]
    
    doctags = output.outputs[0].text
    img_fn = os.path.splitext(img_file)[0]
    output_filename = img_fn + ".dt"
    output_path = os.path.join(OUTPUT_DIR, output_filename)

    with open(output_path, "w", encoding="utf-8") as f:
        f.write(doctags)

    # To convert to Docling Document, MD, HTML, etc.:
    doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
    doc = DoclingDocument(name="Document")
    doc.load_from_doctags(doctags_doc)
    # export as any format
    # HTML
    # doc.save_as_html(output_file)
    # MD
    output_filename_md = img_fn + ".md"
    output_path_md = os.path.join(OUTPUT_DIR, output_filename_md)
    doc.save_as_markdown(output_path_md)

print(f"Total time: {time.time() - start_time:.2f} sec")
```
</details>

## DocTags

<img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/doctags_v2.png" width="800" height="auto" alt="Image description">
DocTags create a clear and structured system of tags and rules that separate text from the document's structure. This makes things easier for Image-to-Sequence models by reducing confusion. On the other hand, converting directly to formats like HTML or Markdown can be messy—it often loses details, doesn’t clearly show the document’s layout, and increases the number of tokens, making processing less efficient.
DocTags are integrated with Docling, which allows export to HTML, Markdown, and JSON. These exports can be offloaded to the CPU, reducing token generation overhead and improving efficiency.

## Supported Instructions

<table>
  <tr>
    <td><b>Description</b></td>
    <td><b>Instruction</b></td>
    <td><b>Comment</b></td>
  </tr>
  <tr>
    <td><b>Full conversion</b></td>
    <td>Convert this page to docling.</td>
    <td>DocTags represetation</td>
  </tr>
  <tr>
    <td><b>Chart</b></td>
    <td>Convert chart to table.</td>
    <td>(e.g., &lt;chart&gt;)</td>
  </tr>
  <tr>
    <td><b>Formula</b></td>
    <td>Convert formula to LaTeX.</td>
    <td>(e.g., &lt;formula&gt;)</td>
  </tr>
  <tr>
    <td><b>Code</b></td>
    <td>Convert code to text.</td>
    <td>(e.g., &lt;code&gt;)</td>
  </tr>
  <tr>
    <td><b>Table</b></td>
    <td>Convert table to OTSL.</td>
    <td>(e.g., &lt;otsl&gt;) OTSL: <a href="https://arxiv.org/pdf/2305.03393">Lysak et al., 2023</a></td>
  </tr>
  <tr>
    <td rowspan=4><b>Actions and Pipelines</b></td>
    <td>OCR the text in a specific location: &lt;loc_155&gt;&lt;loc_233&gt;&lt;loc_206&gt;&lt;loc_237&gt;</td>
    <td></td>
  </tr>
  <tr>
    <td>Identify element at: &lt;loc_247&gt;&lt;loc_482&gt;&lt;10c_252&gt;&lt;loc_486&gt;</td>
    <td></td>
  </tr>
  <tr>
    <td>Find all 'text' elements on the page, retrieve all section headers.</td>
    <td></td>
  </tr>
  <tr>
    <td>Detect footer elements on the page.</td>
    <td></td>
  </tr>
</table>

#### Model Summary

- **Developed by:** Docling Team, IBM Research
- **Model type:** Multi-modal model (image+text)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)

**Repository:** [Docling](https://github.com/docling-project/docling)

**Paper:** [arXiv](https://arxiv.org/abs/2503.11576)

**Citation:**
```
@misc{nassar2025smoldoclingultracompactvisionlanguagemodel,
      title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion}, 
      author={Ahmed Nassar and Andres Marafioti and Matteo Omenetti and Maksym Lysak and Nikolaos Livathinos and Christoph Auer and Lucas Morin and Rafael Teixeira de Lima and Yusik Kim and A. Said Gurbuz and Michele Dolfi and Miquel Farré and Peter W. J. Staar},
      year={2025},
      eprint={2503.11576},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.11576}, 
}
```
**Demo:** [Coming soon]