asnassar's picture
Update README.md
04e22bf verified
|
raw
history blame
6.24 kB
metadata
library_name: transformers
license: apache-2.0
language:
  - en
base_model:
  - HuggingFaceTB/SmolVLM-256M-Instruct
pipeline_tag: image-text-to-text

SmolDocling-256M-preview

SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.

🚀 Features:

  • 🏷️ DocTags for Efficient Tokenization – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments.
  • 🔍 OCR (Optical Character Recognition) – Extracts text accurately from images.
  • 📐 Layout and Localization – Preserves document structure and document element bounding boxes.
  • 💻 Code Recognition – Detects and formats code blocks including identation.
  • 🔢 Formula Recognition – Identifies and processes mathematical expressions.
  • 📊 Chart Recognition – Extracts and interprets chart data.
  • 📑 Table Recognition – Supports column and row headers for structured table extraction.
  • 🖼️ Figure Classification – Differentiates figures and graphical elements.
  • 📝 Caption Correspondence – Links captions to relevant images and figures.
  • 📜 List Grouping – Organizes and structures list elements correctly.
  • 📄 Full-Page Conversion – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
  • 🔲 OCR with Bounding Boxes – OCR regions using a bounding box.
  • 📂 General Document Processing – Trained for non-scientific documents and scientific.
  • 🔄 Seamless Docling Integration – Import into Docling and export in multiple formats.
  • 📚 Multi-Page & Full Document ConversionComing soon! 🚧

How to get started

You can use transformers or docling to perform inference:

Inference using Docling

print(generated_texts[0])
Single image inference using Tranformers
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
    "ds4sd/SmolDocling-256M-preview",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])
🚀 Fast Batch Inference Using VLLM
!pip install vllm

import time
import os
from vllm import LLM, SamplingParams
from PIL import Image

# Configuration
MODEL_PATH = "ds4sd/SmolDocling-256M-preview"
IMAGE_DIR = "images_dir"
OUTPUT_DIR = "output_pred_dir"
PROMPT_TEXT = "Convert page to Docling."

# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Initialize LLM
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192)

chat_template = f"<|im_start|>User:<image>{PROMPT_TEXT}<end_of_utterance>\nAssistant:"

image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))])

start_time = time.time()
total_tokens = 0

for idx, img_file in enumerate(image_files, 1):
    img_path = os.path.join(IMAGE_DIR, img_file)
    image = Image.open(img_path).convert("RGB")

    llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}}
    output = llm.generate([llm_input], sampling_params=sampling_params)[0]
    
    output_text = output.outputs[0].text
    output_filename = os.path.splitext(img_file)[0] + ".dt"
    output_path = os.path.join(OUTPUT_DIR, output_filename)

    with open(output_path, "w", encoding="utf-8") as f:
        f.write(output_text)

print(f"Total time: {time.time() - start_time:.2f} sec")

DocTags

Image description

Supported Instructions

Instruction Description
Full conversion Convert this page to docling.
Chart Convert chart to table (e.g., <chart>).
Formula Convert formula to LaTeX (e.g., <formula>).
Code Convert code to text (e.g., <code>).
Table Convert table to OTSL (e.g., <otsl>).
No-Code Actions/Pipelines OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237>
Identify element at: <loc_247><loc_482><10c_252><loc_486>
Find all 'text' elements on the page, retrieve all section headers.
Detect footer elements on the page.
  • More Coming soon! 🚧

Model Summary

  • Developed by: Docling Team
  • Model type: Multi-modal model (image+text)
  • Language(s) (NLP): English
  • License: Apache 2.0
  • Finetuned from model: Based on Idefics3 (see technical summary)

Repository: [More Information Needed] Paper [optional]: [More Information Needed] Demo [optional]: [More Information Needed]