File size: 6,349 Bytes

efb3e16
 
3d3bb6a
 
 
 
 
 
efb3e16
 
3d3bb6a
efb3e16
82305a1
 
 
 
 
 
 
 
 
3d3bb6a
 
 
 
 
 
 
 
 
 
 
 
 
c1a43cf
 
848bd04
3d3bb6a
 
efb3e16
 
 
848bd04
efb3e16
ee67c58
efb3e16
bc66c00
 
 
 
 
 
 
 
 
 
 
efb3e16
ee67c58
 
 
 
 
efb3e16
ee67c58
efb3e16
ee67c58
 
efb3e16
ee67c58
 
 
 
 
 
 
efb3e16
ee67c58
 
 
 
 
 
 
 
 
 
efb3e16
ee67c58
 
 
 
efb3e16
ee67c58
 
 
 
 
 
efb3e16
ee67c58
63a8850
bc66c00
 
 
 
 
 
 
 
 
 
 
 
af4ca08
bc66c00
 
 
 
 
af4ca08
bc66c00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af4ca08
82305a1
 
 
 
 
af4ca08
c15d2cc
60b40ea
 
 
 
 
 
 
 
 
 
 
af4ca08
 
c15d2cc
23d43f8
06310a9

---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- HuggingFaceTB/SmolVLM-256M-Instruct
pipeline_tag: image-text-to-text
---

### SmolDocling-256M-preview


##  Table of Contents
1. [Model Summary](#model-summary)
2. [Features](#features)
3. [Limitations](#limitations)
4. [Training](#training)
5. [License](#license)
6. [Citation](#citation)

SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for **DoclingDocuments**.

### 🚀 Features:  
- 🏷️ **DocTags for Efficient Tokenization** – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**.  
- 🔍 **OCR (Optical Character Recognition)** – Extracts text accurately from images.  
- 📐 **Layout and Localization** – Preserves document structure and document element **bounding boxes**.  
- 💻 **Code Recognition** – Detects and formats code blocks including identation.  
- 🔢 **Formula Recognition** – Identifies and processes mathematical expressions.  
- 📊 **Chart Recognition** – Extracts and interprets chart data.  
- 📑 **Table Recognition** – Supports column and row headers for structured table extraction.  
- 🖼️ **Figure Classification** – Differentiates figures and graphical elements.  
- 📝 **Caption Correspondence** – Links captions to relevant images and figures.  
- 📜 **List Grouping** – Organizes and structures list elements correctly.  
- 📄 **Full-Page Conversion** – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.) 
- 🔲 **OCR with Bounding Boxes** – OCR regions using a bounding box.
- 📂 **General Document Processing** – Trained for non-scientific documents and scientific.  
- 🔄 **Seamless Docling Integration** – Import into **Docling** and export in multiple formats.
- 📚 **Multi-Page & Full Document Conversion** – *Coming soon!* 🚧



## How to get started

You can use transformers or docling to perform inference:

<details>
<summary>Inference using Docling</summary>

```python

print(generated_texts[0])
```
</details>

<details>
<summary>Single image inference using Tranformers</summary>

```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
    "ds4sd/SmolDocling-256M-preview",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])
```
</details>

<details>
<summary> 🚀 Fast Batch Inference Using VLLM</summary>

```python
!pip install vllm

import time
import os
from vllm import LLM, SamplingParams
from PIL import Image

# Configuration
MODEL_PATH = "ds4sd/SmolDocling-256M-preview"
IMAGE_DIR = "images_dir"
OUTPUT_DIR = "output_pred_dir"
PROMPT_TEXT = "Convert page to Docling."

# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Initialize LLM
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192)

chat_template = f"<|im_start|>User:<image>{PROMPT_TEXT}<end_of_utterance>\nAssistant:"

image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))])

start_time = time.time()
total_tokens = 0

for idx, img_file in enumerate(image_files, 1):
    img_path = os.path.join(IMAGE_DIR, img_file)
    image = Image.open(img_path).convert("RGB")

    llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}}
    output = llm.generate([llm_input], sampling_params=sampling_params)[0]
    
    output_text = output.outputs[0].text
    output_filename = os.path.splitext(img_file)[0] + ".dt"
    output_path = os.path.join(OUTPUT_DIR, output_filename)

    with open(output_path, "w", encoding="utf-8") as f:
        f.write(output_text)

print(f"Total time: {time.time() - start_time:.2f} sec")
```
</details>

## DocTags

<img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/blob/main/assets/doctags_v2.png" width="800" height="auto" alt="Image description">



## Supported Instructions  
| Instruction | Description |
| :---: | :---: |
| Full conversion | Convert this page to docling. |
| Chart | Convert chart to table (e.g., <chart>). |
| Formula | Convert formula to LaTeX (e.g., <formula>). |
| Code | Convert code to text (e.g., <code>). |
| Table | Convert table to OTSL (e.g., <otsl>). |
| No-Code Actions/Pipelines | OCR the text in a specific location: <10c-155><10c.233><10c.206><10c.237> |
|  | Identify element at: <loc_247><loc_482><10c_252><loc_486> |
|  | Find all 'text' elements on the page, retrieve all section headers. |
|  | Detect footer elements on the page. |


- More *Coming soon!* 🚧

#### Model Summary

- **Developed by:** Docling Team
- **Model type:** Multi-modal model (image+text)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)

**Repository:** [More Information Needed]
**Paper [optional]:** [More Information Needed]
**Demo [optional]:** [More Information Needed]