---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- HuggingFaceTB/SmolVLM-256M-Instruct
pipeline_tag: image-text-to-text
---
### SmolDocling-256M-preview
SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for **DoclingDocuments**.
### 🚀 Features:
- 🏷️ **DocTags for Efficient Tokenization** – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**.
- 🔍 **OCR (Optical Character Recognition)** – Extracts text accurately from images.
- 📐 **Layout and Localization** – Preserves document structure and document element **bounding boxes**.
- 💻 **Code Recognition** – Detects and formats code blocks including identation.
- 🔢 **Formula Recognition** – Identifies and processes mathematical expressions.
- 📊 **Chart Recognition** – Extracts and interprets chart data.
- 📑 **Table Recognition** – Supports column and row headers for structured table extraction.
- 🖼️ **Figure Classification** – Differentiates figures and graphical elements.
- 📝 **Caption Correspondence** – Links captions to relevant images and figures.
- 📜 **List Grouping** – Organizes and structures list elements correctly.
- 📄 **Full-Page Conversion** – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
- 🔲 **OCR with Bounding Boxes** – OCR regions using a bounding box.
- 📂 **General Document Processing** – Trained for non-scientific documents and scientific.
- 🔄 **Seamless Docling Integration** – Import into **Docling** and export in multiple formats.
- 📚 **Multi-Page & Full Document Conversion** – *Coming soon!* 🚧
## How to get started
You can use transformers or docling to perform inference:
Inference using Docling
```python
print(generated_texts[0])
```
Single image inference using Tranformers
```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_texts[0])
```
🚀 Fast Batch Inference Using VLLM
```python
!pip install vllm
import time
import os
from vllm import LLM, SamplingParams
from PIL import Image
# Configuration
MODEL_PATH = "ds4sd/SmolDocling-256M-preview"
IMAGE_DIR = "images_dir"
OUTPUT_DIR = "output_pred_dir"
PROMPT_TEXT = "Convert page to Docling."
# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Initialize LLM
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192)
chat_template = f"<|im_start|>User:{PROMPT_TEXT}\nAssistant:"
image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))])
start_time = time.time()
total_tokens = 0
for idx, img_file in enumerate(image_files, 1):
img_path = os.path.join(IMAGE_DIR, img_file)
image = Image.open(img_path).convert("RGB")
llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}}
output = llm.generate([llm_input], sampling_params=sampling_params)[0]
output_text = output.outputs[0].text
output_filename = os.path.splitext(img_file)[0] + ".dt"
output_path = os.path.join(OUTPUT_DIR, output_filename)
with open(output_path, "w", encoding="utf-8") as f:
f.write(output_text)
print(f"Total time: {time.time() - start_time:.2f} sec")
```
## DocTags
## Supported Instructions
| Instruction | Description |
| :---: | :---: |
| Full conversion | Convert this page to docling. |
| Chart | Convert chart to table (e.g., <chart>). |
| Formula | Convert formula to LaTeX (e.g., <formula>). |
| Code | Convert code to text (e.g., <code>). |
| Table | Convert table to OTSL (e.g., <otsl>). |
| No-Code Actions/Pipelines | OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> |
| | Identify element at: <loc_247><loc_482><10c_252><loc_486> |
| | Find all 'text' elements on the page, retrieve all section headers. |
| | Detect footer elements on the page. |
- More *Coming soon!* 🚧
#### Model Summary
- **Developed by:** Docling Team
- **Model type:** Multi-modal model (image+text)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)
**Repository:** [More Information Needed]
**Paper [optional]:** [More Information Needed]
**Demo [optional]:** [More Information Needed]