--- library_name: transformers license: apache-2.0 language: - en base_model: - HuggingFaceTB/SmolVLM-256M-Instruct pipeline_tag: image-text-to-text --- ### SmolDocling-256M-preview SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for **DoclingDocuments**. ### 🚀 Features: - 🏷️ **DocTags for Efficient Tokenization** – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**. - 🔍 **OCR (Optical Character Recognition)** – Extracts text accurately from images. - 📐 **Layout and Localization** – Preserves document structure and document element **bounding boxes**. - 💻 **Code Recognition** – Detects and formats code blocks including identation. - 🔢 **Formula Recognition** – Identifies and processes mathematical expressions. - 📊 **Chart Recognition** – Extracts and interprets chart data. - 📑 **Table Recognition** – Supports column and row headers for structured table extraction. - 🖼️ **Figure Classification** – Differentiates figures and graphical elements. - 📝 **Caption Correspondence** – Links captions to relevant images and figures. - 📜 **List Grouping** – Organizes and structures list elements correctly. - 📄 **Full-Page Conversion** – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.) - 🔲 **OCR with Bounding Boxes** – OCR regions using a bounding box. - 📂 **General Document Processing** – Trained for non-scientific documents and scientific. - 🔄 **Seamless Docling Integration** – Import into **Docling** and export in multiple formats. - 📚 **Multi-Page & Full Document Conversion** – *Coming soon!* 🚧 ## How to get started You can use transformers or docling to perform inference:
Inference using Docling ```python print(generated_texts[0]) ```
Single image inference using Tranformers ```python import torch from PIL import Image from transformers import AutoProcessor, AutoModelForVision2Seq from transformers.image_utils import load_image DEVICE = "cuda" if torch.cuda.is_available() else "cpu" # Load images image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg") # Initialize processor and model processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview") model = AutoModelForVision2Seq.from_pretrained( "ds4sd/SmolDocling-256M-preview", torch_dtype=torch.bfloat16, _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager", ).to(DEVICE) # Create input messages messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Convert this page to docling."} ] }, ] # Prepare inputs prompt = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(text=prompt, images=[image], return_tensors="pt") inputs = inputs.to(DEVICE) # Generate outputs generated_ids = model.generate(**inputs, max_new_tokens=500) generated_texts = processor.batch_decode( generated_ids, skip_special_tokens=True, ) print(generated_texts[0]) ```
🚀 Fast Batch Inference Using VLLM ```python !pip install vllm import time import os from vllm import LLM, SamplingParams from PIL import Image # Configuration MODEL_PATH = "ds4sd/SmolDocling-256M-preview" IMAGE_DIR = "images_dir" OUTPUT_DIR = "output_pred_dir" PROMPT_TEXT = "Convert page to Docling." # Ensure output directory exists os.makedirs(OUTPUT_DIR, exist_ok=True) # Initialize LLM llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1}) sampling_params = SamplingParams( temperature=0.0, max_tokens=8192) chat_template = f"<|im_start|>User:{PROMPT_TEXT}\nAssistant:" image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))]) start_time = time.time() total_tokens = 0 for idx, img_file in enumerate(image_files, 1): img_path = os.path.join(IMAGE_DIR, img_file) image = Image.open(img_path).convert("RGB") llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}} output = llm.generate([llm_input], sampling_params=sampling_params)[0] output_text = output.outputs[0].text output_filename = os.path.splitext(img_file)[0] + ".dt" output_path = os.path.join(OUTPUT_DIR, output_filename) with open(output_path, "w", encoding="utf-8") as f: f.write(output_text) print(f"Total time: {time.time() - start_time:.2f} sec") ```
## DocTags Image description ## Supported Instructions | Instruction | Description | | :---: | :---: | | Full conversion | Convert this page to docling. | | Chart | Convert chart to table (e.g., <chart>). | | Formula | Convert formula to LaTeX (e.g., <formula>). | | Code | Convert code to text (e.g., <code>). | | Table | Convert table to OTSL (e.g., <otsl>). | | No-Code Actions/Pipelines | OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> | | | Identify element at: <loc_247><loc_482><10c_252><loc_486> | | | Find all 'text' elements on the page, retrieve all section headers. | | | Detect footer elements on the page. | - More *Coming soon!* 🚧 #### Model Summary - **Developed by:** Docling Team - **Model type:** Multi-modal model (image+text) - **Language(s) (NLP):** English - **License:** Apache 2.0 - **Finetuned from model:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary) **Repository:** [More Information Needed] **Paper [optional]:** [More Information Needed] **Demo [optional]:** [More Information Needed]