File size: 9,392 Bytes
efb3e16 3d3bb6a efb3e16 cd2c4cb 3d3bb6a c1a43cf 848bd04 3d3bb6a 2624dcc efb3e16 848bd04 efb3e16 ee67c58 efb3e16 bc66c00 1074b94 bc66c00 1074b94 b742d1c 414466c b742d1c bc66c00 1074b94 bc66c00 1074b94 efb3e16 ee67c58 efb3e16 ee67c58 efb3e16 ee67c58 414466c efb3e16 ee67c58 efb3e16 ee67c58 414466c ee67c58 efb3e16 ee67c58 414466c ee67c58 efb3e16 ee67c58 414466c efb3e16 414466c 1074b94 bc66c00 af4ca08 bc66c00 af4ca08 bc66c00 af4ca08 82305a1 04e22bf 56b796c af4ca08 c15d2cc 60b40ea d456bf8 de6b20e d456bf8 60b40ea af4ca08 c15d2cc 23d43f8 06310a9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 |
---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- HuggingFaceTB/SmolVLM-256M-Instruct
pipeline_tag: image-text-to-text
---
<div style="display: flex; align-items: center;">
<img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/SmolDocling_doctags1.png" alt="SmolDocling" style="width: 200px; height: auto; margin-right: 20px;">
<div>
<h3>SmolDocling-256M-preview</h3>
<p>SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for <strong>DoclingDocuments</strong>.</p>
</div>
</div>
### 🚀 Features:
- 🏷️ **DocTags for Efficient Tokenization** – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**.
- 🔍 **OCR (Optical Character Recognition)** – Extracts text accurately from images.
- 📐 **Layout and Localization** – Preserves document structure and document element **bounding boxes**.
- 💻 **Code Recognition** – Detects and formats code blocks including identation.
- 🔢 **Formula Recognition** – Identifies and processes mathematical expressions.
- 📊 **Chart Recognition** – Extracts and interprets chart data.
- 📑 **Table Recognition** – Supports column and row headers for structured table extraction.
- 🖼️ **Figure Classification** – Differentiates figures and graphical elements.
- 📝 **Caption Correspondence** – Links captions to relevant images and figures.
- 📜 **List Grouping** – Organizes and structures list elements correctly.
- 📄 **Full-Page Conversion** – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
- 🔲 **OCR with Bounding Boxes** – OCR regions using a bounding box.
- 📂 **General Document Processing** – Trained for non-scientific documents and scientific.
- 🔄 **Seamless Docling Integration** – Import into **Docling** and export in multiple formats.
- 📚 **Multi-Page & Full Document Conversion** – Supporting upto 3 pages.
## How to get started
You can use transformers or docling to perform inference:
<details>
<summary>Single image inference using Tranformers</summary>
```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
# create a docling document
doc = DoclingDocument(name="Document")
# populate it
doc.load_from_document_tokens([doctags], [image])
# export as any format
# HTML
# print(doc.export_to_html())
# with open(output_file, "w", encoding="utf-8") as f:
# f.write(doc.export_to_html())
# MD
# print(doc.export_to_markdown())
```
</details>
<details>
<summary>Multi-page image inference using Tranformers</summary>
```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
page_1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
page_2 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "Convert this document to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[page_1, page_2], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
# create a docling document
doc = DoclingDocument(name="Document")
# populate it
doc.load_from_document_tokens([doctags], [page_1, page_2])
# export as any format
# HTML
# print(doc.export_to_html())
# with open(output_file, "w", encoding="utf-8") as f:
# f.write(doc.export_to_html())
# MD
# print(doc.export_to_markdown())
``````
</details>
<details>
<summary> 🚀 Fast Batch Inference Using VLLM</summary>
```python
!pip install vllm
import time
import os
from vllm import LLM, SamplingParams
from PIL import Image
# Configuration
MODEL_PATH = "ds4sd/SmolDocling-256M-preview"
IMAGE_DIR = "images_dir"
OUTPUT_DIR = "output_pred_dir"
PROMPT_TEXT = "Convert page to Docling."
# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Initialize LLM
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192)
chat_template = f"<|im_start|>User:<image>{PROMPT_TEXT}<end_of_utterance>\nAssistant:"
image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))])
start_time = time.time()
total_tokens = 0
for idx, img_file in enumerate(image_files, 1):
img_path = os.path.join(IMAGE_DIR, img_file)
image = Image.open(img_path).convert("RGB")
llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}}
output = llm.generate([llm_input], sampling_params=sampling_params)[0]
output_text = output.outputs[0].text
output_filename = os.path.splitext(img_file)[0] + ".dt"
output_path = os.path.join(OUTPUT_DIR, output_filename)
with open(output_path, "w", encoding="utf-8") as f:
f.write(output_text)
print(f"Total time: {time.time() - start_time:.2f} sec")
```
</details>
## DocTags
<img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/doctags_v2.png" width="800" height="auto" alt="Image description">
DocTags create a clear and structured system of tags and rules that separate text from the document's structure. This makes things easier for Image-to-Sequence models by reducing confusion. On the other hand, converting directly to formats like HTML or Markdown can be messy—it often loses details, doesn’t clearly show the document’s layout, and increases the number of tokens, making processing less efficient.
DocTags are integrated with Docling, which allows export to HTML, Markdown, and JSON. These exports can be offloaded to the CPU, reducing token generation overhead and improving efficiency.
## Supported Instructions
| Instruction | Description |
| :---: | :---: |
| Full conversion | Convert this page to docling. |
| Chart | Convert chart to table (e.g., <chart>). |
| Formula | Convert formula to LaTeX (e.g., <formula>). |
| Code | Convert code to text (e.g., <code>). |
| Table | Convert table to OTSL (e.g., <otsl>). |
| No-Code Actions/Pipelines | OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> |
| | Identify element at: <loc_247><loc_482><10c_252><loc_486> |
| | Find all 'text' elements on the page, retrieve all section headers. |
| | Detect footer elements on the page. |
- More *Coming soon!* 🚧
#### Model Summary
- **Developed by:** Docling Team
- **Model type:** Multi-modal model (image+text)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)
**Repository:** [More Information Needed]
**Paper [optional]:** [More Information Needed]
**Demo [optional]:** [More Information Needed]
|