Image-Text-to-Text
Transformers
ONNX
Safetensors
English
idefics3
conversational
asnassar's picture
Update README.md
c1a43cf verified
|
raw
history blame
3.74 kB
metadata
library_name: transformers
license: apache-2.0
language:
  - en
base_model:
  - HuggingFaceTB/SmolVLM-256M-Instruct
pipeline_tag: image-text-to-text

SmolDocling-256M-preview

SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.

🚀 Features:

  • 🏷️ DocTags for Efficient Tokenization – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments.
  • 🔍 OCR (Optical Character Recognition) – Extracts text accurately from images.
  • 📐 Layout and Localization – Preserves document structure and document element bounding boxes.
  • 💻 Code Recognition – Detects and formats code blocks including identation.
  • 🔢 Formula Recognition – Identifies and processes mathematical expressions.
  • 📊 Chart Recognition – Extracts and interprets chart data.
  • 📑 Table Recognition – Supports column and row headers for structured table extraction.
  • 🖼️ Figure Classification – Differentiates figures and graphical elements.
  • 📝 Caption Correspondence – Links captions to relevant images and figures.
  • 📜 List Grouping – Organizes and structures list elements correctly.
  • 📄 Full-Page Conversion – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
  • 🔲 OCR with Bounding Boxes – OCR regions using a bounding box.
  • 📂 General Document Processing – Trained for non-scientific documents and scientific.
  • 🔄 Seamless Docling Integration – Import into Docling and export in multiple formats.
  • 📚 Multi-Page & Full Document ConversionComing soon! 🚧

Repository: [More Information Needed] Paper [optional]: [More Information Needed] Demo [optional]: [More Information Needed]

Model Summary

  • Developed by: Docling Team
  • Model type: Multi-modal model (image+text)
  • Language(s) (NLP): English
  • License: Apache 2.0
  • Finetuned from model: Based on Idefics3 (see technical summary)

How to get started

You can use transformers or docling to perform inference:

Transformers:

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
    "ds4sd/SmolDocling-256M-preview",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])
"""