Update README.md

904b2d3 verified 12 days ago

8.24 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	base_model:
	- HuggingFaceTB/SmolVLM-256M-Instruct
	pipeline_tag: image-text-to-text
	---

	<div style="display: flex; align-items: center;">
	<img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/SmolDocling_doctags1.png" alt="SmolDocling" style="width: 200px; height: auto; margin-right: 20px;">
	<div>
	<h3>SmolDocling-256M-preview</h3>
	<p>SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for <strong>DoclingDocuments</strong>.</p>
	</div>
	</div>

	### 🚀 Features:
	- 🏷️ DocTags for Efficient Tokenization – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments.
	- 🔍 OCR (Optical Character Recognition) – Extracts text accurately from images.
	- 📐 Layout and Localization – Preserves document structure and document element bounding boxes.
	- 💻 Code Recognition – Detects and formats code blocks including identation.
	- 🔢 Formula Recognition – Identifies and processes mathematical expressions.
	- 📊 Chart Recognition – Extracts and interprets chart data.
	- 📑 Table Recognition – Supports column and row headers for structured table extraction.
	- 🖼️ Figure Classification – Differentiates figures and graphical elements.
	- 📝 Caption Correspondence – Links captions to relevant images and figures.
	- 📜 List Grouping – Organizes and structures list elements correctly.
	- 📄 Full-Page Conversion – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
	- 🔲 OCR with Bounding Boxes – OCR regions using a bounding box.
	- 📂 General Document Processing – Trained for both scientific and non-scientific documents.
	- 🔄 Seamless Docling Integration – Import into Docling and export in multiple formats.
	- 📚 Multi-Page & Full Document Conversion – Coming Soon.
	- 🧪 Chemical Recognition – Coming Soon.

	### 🚧 Coming soon!
	- 📊 Better chart recognition 🛠️
	- 📚 One shot multi-page inference ⏱️

	## How to get started

	You can use transformers or docling to perform inference:

	<details>
	<summary>📄 Single page image inference using Tranformers 🤖</summary>

	```python
	# Prerequisites:
	# pip install torch
	# pip install docling_core

	import torch
	from docling_core.types.doc import DoclingDocument
	from docling_core.types.doc.document import DocTagsDocument
	from transformers import AutoProcessor, AutoModelForVision2Seq
	from transformers.image_utils import load_image

	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

	# Load images
	image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg")

	# Initialize processor and model
	processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
	model = AutoModelForVision2Seq.from_pretrained(
	"ds4sd/SmolDocling-256M-preview",
	torch_dtype=torch.bfloat16,
	_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
	).to(DEVICE)

	# Create input messages
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "Convert this page to docling."}
	]
	},
	]

	# Prepare inputs
	prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
	inputs = processor(text=prompt, images=[image], return_tensors="pt")
	inputs = inputs.to(DEVICE)

	# Generate outputs
	generated_ids = model.generate(**inputs, max_new_tokens=8192)
	prompt_length = inputs.input_ids.shape[1]
	trimmed_generated_ids = generated_ids[:, prompt_length:]
	doctags = processor.batch_decode(
	trimmed_generated_ids,
	skip_special_tokens=False,
	)[0].lstrip()

	# Populate document
	doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
	# create a docling document
	doc = DoclingDocument(name="Document")
	doc.load_from_doctags(doctags_doc)

	# export as any format
	# HTML
	# print(doc.export_to_html())
	# with open(output_file, "w", encoding="utf-8") as f:
	# f.write(doc.export_to_html())
	# MD
	print(doc.export_to_markdown())
	```
	</details>


	<details>
	<summary> 🚀 Fast Batch Inference Using VLLM</summary>

	```python
	!pip install vllm

	import time
	import os
	from vllm import LLM, SamplingParams
	from PIL import Image

	# Configuration
	MODEL_PATH = "ds4sd/SmolDocling-256M-preview"
	IMAGE_DIR = "images_dir"
	OUTPUT_DIR = "output_pred_dir"
	PROMPT_TEXT = "Convert page to Docling."

	# Ensure output directory exists
	os.makedirs(OUTPUT_DIR, exist_ok=True)

	# Initialize LLM
	llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})

	sampling_params = SamplingParams(
	temperature=0.0,
	max_tokens=8192)

	chat_template = f"<\|im_start\|>User:<image>{PROMPT_TEXT}<end_of_utterance>\nAssistant:"

	image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))])

	start_time = time.time()
	total_tokens = 0

	for idx, img_file in enumerate(image_files, 1):
	img_path = os.path.join(IMAGE_DIR, img_file)
	image = Image.open(img_path).convert("RGB")

	llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}}
	output = llm.generate([llm_input], sampling_params=sampling_params)[0]

	output_text = output.outputs[0].text
	output_filename = os.path.splitext(img_file)[0] + ".dt"
	output_path = os.path.join(OUTPUT_DIR, output_filename)

	with open(output_path, "w", encoding="utf-8") as f:
	f.write(output_text)

	print(f"Total time: {time.time() - start_time:.2f} sec")
	```
	</details>

	## DocTags

	<img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/doctags_v2.png" width="800" height="auto" alt="Image description">
	DocTags create a clear and structured system of tags and rules that separate text from the document's structure. This makes things easier for Image-to-Sequence models by reducing confusion. On the other hand, converting directly to formats like HTML or Markdown can be messy—it often loses details, doesn’t clearly show the document’s layout, and increases the number of tokens, making processing less efficient.
	DocTags are integrated with Docling, which allows export to HTML, Markdown, and JSON. These exports can be offloaded to the CPU, reducing token generation overhead and improving efficiency.

	## Supported Instructions

	<table>
	<tr>
	<td><b>Description</b></td>
	<td><b>Instruction</b></td>
	</tr>
	<tr>
	<td>Full conversion</td>
	<td>Convert this page to docling.</td>
	</tr>
	<tr>
	<td>Chart</td>
	<td>Convert chart to table (e.g., <chart>).</td>
	</tr>
	<tr>
	<td>Formula</td>
	<td>Convert formula to LaTeX (e.g., <formula>).</td>
	</tr>
	<tr>
	<td>Code</td>
	<td>Convert code to text (e.g., <code>).</td>
	</tr>
	<tr>
	<td>Table</td>
	<td>Convert table to OTSL (e.g., <otsl>). OTSL: <a href="https://arxiv.org/pdf/2305.03393">Lysak et al., 2023</a></td>
	</tr>
	<tr>
	<td>No-Code Actions/Pipelines</td>
	<td>OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237></td>
	</tr>
	<tr>
	<td></td>
	<td>Identify element at: <loc_247><loc_482><10c_252><loc_486></td>
	</tr>
	<tr>
	<td></td>
	<td>Find all 'text' elements on the page, retrieve all section headers.</td>
	</tr>
	<tr>
	<td></td>
	<td>Detect footer elements on the page.</td>
	</tr>
	</table>

	#### Model Summary

	- Developed by: Docling Team
	- Model type: Multi-modal model (image+text)
	- Language(s) (NLP): English
	- License: Apache 2.0
	- Finetuned from model: Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)

	Repository: [Docling](https://github.com/docling-project/docling)
	Paper [optional]: [Coming soon]
	Demo [optional]: [Coming soon]