Docling: An open-source toolkit for advanced document processing

Thursday May 29, 2025. 11:00 AM , from InfoWorld

The rapid advancements in generative AI technologies have made it essential to create tools that can effectively manage different document formats. While large language models (LLMs) excel at understanding and generating text, they struggle with parsing complex documents like PDFs, presentations, and spreadsheets.

Docling, an open-source toolkit developed by IBM Research, addresses this challenge by providing advanced document conversion capabilities specifically designed for AI applications.

As organizations increasingly seek to leverage their internal documents for AI training and retrieval-augmented generation (RAG), the importance of high-quality document processing has become paramount. Docling offers a comprehensive solution that transforms unstructured documents into formats readily consumable by modern AI systems.

Project overview – Docling

Docling is an open-source Python package for document conversion that can parse multiple formats into a unified, richly structured representation. Initially developed by IBM’s AI for Knowledge team at IBM Research Zurich, Docling was open-sourced in July 2024 and has since gained remarkable traction in the developer community.

The project has experienced explosive growth, gathering more than 30,000 stars on GitHub to date. It was identified as the top trending repository worldwide in November 2024. This rapid adoption reflects the critical need for advanced document processing tools in the AI ecosystem.

Docling is now hosted as a project in the LF AI & Data Foundation, ensuring its continued development as part of the open-source community. Red Hat also has embraced Docling, with plans to include it as a supported feature in upcoming Red Hat Enterprise Linux AI releases.

Key features of Docling include:

Parsing of multiple document formats including PDF, DOCX, XLSX, HTML, and images.

Advanced PDF understanding capabilities for page layout, reading order, table structure, and formulas.

A unified document representation format (DoclingDocument).

Export options including Markdown, HTML, and lossless JSON.

Local execution capabilities for sensitive data and air-gapped environments.

Plug-and-play integrations with LangChain, LlamaIndex, Crew AI, and Haystack.

What problem does Docling solve?

Integrating documents into AI workflows presents significant challenges. Traditional document processing tools often fail to capture the rich structure of documents, resulting in lost context and diminished utility for AI applications. OCR-based approaches can be error-prone and computationally expensive, especially for complex layouts and tables.

Specific pain points addressed by Docling include:

Converting diverse document formats into AI-ready representations.

Preserving document structure, including tables, formulas, and reading order.

Processing documents locally to maintain data privacy.

Reducing computational requirements compared to traditional OCR approaches.

As noted by Peter Staar, an IBM researcher who helped build Docling, “Avoiding OCR reduces errors, and it also speeds up the time-to-solution by 30 times.” This efficiency gain makes Docling particularly valuable for large-scale document processing tasks.

A closer look at Docling

At the core of Docling is the DoclingDocument representation format, defined as a Pydantic data type that can express common document features including text, tables, pictures, document hierarchy, layout information, and provenance details.

Docling leverages two state-of-the-art AI models developed by IBM Research:

Layout Analysis Model: A model based on RT-DETR and trained on DocLayNet (a human-annotated data set for document layout analysis) that classifies page elements like paragraphs, section titles, lists, and tables.

TableFormer: A vision-transformer model for table structure recovery that can handle complex tables with partial or no borderlines, empty cells, cell spans, and hierarchical headers.

The Docling processing pipeline works by feeding page images to the Layout Analysis Model, which identifies document elements. For tables, TableFormer processes the detected table regions to recover their structure. When needed, OCR capabilities are available through integration with EasyOCR.

Using Docling is straightforward:

from docling.document_converter import DocumentConverter

source = 'https://arxiv.org/pdf/2408.09869' # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown()) # output: '## Docling Technical Report'

Docling also provides a convenient command-line interface for quick conversions:

docling https://arxiv.org/pdf/2206.01062

Key use cases for Docling

Docling’s capabilities make it ideal for several critical use cases including retrieval-augmented generation, knowledge base creation, LLM fine-tuning, and enterprise data integration.

RAG systems

Docling enables efficient document processing for retrieval-augmented generation applications, as demonstrated in numerous integration examples with vector databases like Milvus.

import os
from openai import OpenAI
from docling.document_converter import DocumentConverter

# Convert document with Docling
converter = DocumentConverter()
result = converter.convert('document.pdf')
markdown_content = result.document.export_to_markdown()

# Use content in RAG application
openai_client = OpenAI()
response = openai_client.chat.completions.create(
model='gpt-4o',
messages=[
{'role': 'system', 'content': 'Use the provided context to answer questions.'},
{'role': 'user', 'content': f'Context: {markdown_content}nnQuestion: What is the main topic?'}
]
)

Knowledge base creation

Organizations can use Docling to convert internal documents into standardized formats for knowledge repositories and search systems.

LLM fine-tuning

Docling processes large document collections into consistent formats suitable for fine-tuning custom LLMs with domain-specific knowledge.

Enterprise data integration

Companies can unlock data from proprietary documents for generative AI applications, as highlighted by Red Hat’s integration of Docling into Red Hat Enterprise Linux AI.

Bottom line – Docling

Docling represents a significant advancement in document processing for AI applications. By combining state-of-the-art models for layout analysis and table structure recognition with a unified document representation format, it bridges the gap between raw documents and AI-ready content.

The project’s rapid adoption, both in the open-source community and by major companies like Red Hat, underscores its value in the AI ecosystem. As organizations continue to seek ways to leverage their document repositories for generative AI, Docling promises to play an important role as the “missing document processing companion for generative AI.”

With its permissive MIT license, Docling enables wide adoption and collaboration while addressing the critical need for high-quality document processing in the AI workflow. For developers working at the intersection of documents and generative AI, Docling offers a powerful, efficient, and accessible solution to one of the field’s most persistent challenges.