快捷導(dǎo)航

Python使用Docling庫玩轉(zhuǎn)文檔處理

更新時間：2025年02月17日 10:30:23 作者：正東AI

Docling?是一個強(qiáng)大的?Python?第三方庫,專注于文檔處理和轉(zhuǎn)換,所以本文將帶大家深入了解?Docling?的強(qiáng)大功能,展示它如何幫助我們高效處理文檔,感興趣的可以了解下

一、背景

在日常開發(fā)中，文檔處理一直是令人頭疼的問題。無論是技術(shù)文檔、設(shè)計文檔，還是各種格式的文件（如 PDF、DOCX、PPTX等），如何高效地解析、轉(zhuǎn)換和提取信息，常常耗費大量精力。Docling 的出現(xiàn)，為這一問題提供了優(yōu)雅的解決方案。它不僅支持多種主流文檔格式，還能深度解析PDF，提取頁面布局、表格結(jié)構(gòu)等復(fù)雜信息。更重要的是，Docling 提供了統(tǒng)一的文檔表示格式和便捷的 CLI，使得文檔處理變得簡單高效。

接下來，我們將深入了解 Docling 的強(qiáng)大功能，并通過實際代碼示例，展示它如何幫助我們高效處理文檔。

二、什么是 Docling

Docling 是一個強(qiáng)大的 Python 第三方庫，專注于文檔處理和轉(zhuǎn)換。它支持多種文檔格式，包括PDF、DOCX、PPTX、HTML、圖片等。Docling 的核心功能是深度解析 PDF，能夠識別頁面布局、閱讀順序、表格結(jié)構(gòu)，甚至支持 OCR功能，處理掃描版文檔。此外，Docling 還提供了統(tǒng)一的文檔表示格式（DoclingDocument），方便開發(fā)者進(jìn)行后續(xù)處理。

三、安裝 Docling

作為第三方庫，Docling 的安裝非常簡單。只需通過 pip 命令即可完成安裝：

pip install docling

如果需要支持 CPU 版本的 PyTorch，可以使用以下命令：

pip install docling --extra-index-url https://download.pytorch.org/whl/cpu

安裝完成后，即可使用 Docling 提供的強(qiáng)大功能。

四、庫函數(shù)使用方法

以下是 Docling 的五個常用函數(shù)及其使用方法：

1. DocumentConverter.convert()

該函數(shù)用于轉(zhuǎn)換文檔，支持本地路徑或 URL。

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # 文檔路徑或 URL
converter = DocumentConverter()
result = converter.convert(source)

source：文檔的路徑或 URL。

converter.convert()：將文檔轉(zhuǎn)換為 Docling 的內(nèi)部表示格式。

2. export_to_markdown()

將文檔導(dǎo)出為 Markdown 格式。

markdown_content = result.document.export_to_markdown()
print(markdown_content)

export_to_markdown()：將文檔內(nèi)容轉(zhuǎn)換為 Markdown 格式。

3. export_to_json()

將文檔導(dǎo)出為 JSON 格式。

json_content = result.document.export_to_json()
print(json_content)

export_to_json()：將文檔內(nèi)容轉(zhuǎn)換為 JSON 格式。

4. HierarchicalChunker.chunk()

對文檔進(jìn)行分塊處理，返回文本內(nèi)容和元數(shù)據(jù)。

from docling_core.transforms.chunker import HierarchicalChunker

chunks = list(HierarchicalChunker().chunk(result.document))
print(chunks[0])

HierarchicalChunker()：創(chuàng)建分塊器。

chunk(result.document)：對文檔進(jìn)行分塊處理。

5. PdfPipelineOptions

自定義 PDF 轉(zhuǎn)換選項。

from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.do_cell_matching = False

PdfPipelineOptions：自定義 PDF 轉(zhuǎn)換選項。

do_table_structure：是否解析表格結(jié)構(gòu)。

do_cell_matching：是否將表格單元格映射回 PDF。

五、使用場景示例

以下是五個實際使用場景及其代碼示例：

場景 1：將 PDF 轉(zhuǎn)換為 Markdown

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
markdown_content = result.document.export_to_markdown()
print(markdown_content)

convert()：將 PDF 轉(zhuǎn)換為 Docling 的內(nèi)部表示格式。

export_to_markdown()：將文檔導(dǎo)出為 Markdown 格式。

場景 2：限制文檔大小

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source, max_num_pages=100, max_file_size=20971520)

max_num_pages：限制文檔的最大頁數(shù)。

max_file_size：限制文檔的最大文件大小。

場景 3：自定義 PDF 轉(zhuǎn)換選項

from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter

pipeline_options = PdfPipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.do_cell_matching = False
converter = DocumentConverter(pipeline_options=pipeline_options)
result = converter.convert("path/to/your/document.pdf")

PdfPipelineOptions：自定義 PDF 轉(zhuǎn)換選項。

do_table_structure 和 do_cell_matching：控制表格結(jié)構(gòu)的解析方式。

場景 4：文檔分塊處理

from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker import HierarchicalChunker

converter = DocumentConverter()
result = converter.convert("https://arxiv.org/pdf/2206.01062")
chunks = list(HierarchicalChunker().chunk(result.document))
print(chunks[0])

HierarchicalChunker.chunk()：對文檔進(jìn)行分塊處理。

輸出包含文本內(nèi)容和元數(shù)據(jù)，方便后續(xù)處理。

場景 5：使用 OCR 處理掃描版 PDF

from docling.datamodel.pipeline_options import PipelineOptions, TesseractOcrOptions
from docling.document_converter import DocumentConverter

pipeline_options = PipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions()
converter = DocumentConverter(pipeline_options=pipeline_options)
result = converter.convert("path/to/scanned_document.pdf")

PipelineOptions 和 TesseractOcrOptions：配置 OCR 選項。

do_ocr：啟用 OCR 功能。

六、常見問題及解決方案

以下是使用 Docling 時常見的三個問題及其解決方案：

問題 1：TensorFlow 相關(guān)警告

錯誤信息：

This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.

解決方案：安裝適合 CPU 的 TensorFlow 版本。

conda create --name py11 python==3.11
conda activate py11
conda install tensorflow

問題 2：Tesseract OCR 安裝問題

錯誤信息：Tesseract OCR 未安裝或配置錯誤。

解決方案：安裝 Tesseract OCR 并設(shè)置 TESSDATA_PREFIX。

# macOS
brew install tesseract leptonica pkg-config
TESSDATA_PREFIX=/opt/homebrew/share/tessdata/

# Linux
apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config
TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)

問題 3：Tesserocr 安裝失敗

錯誤信息：Tesserocr 安裝失敗。

解決方案：重新安裝 Tesserocr。

pip uninstall tesserocr
pip install --no-binary :all: tesserocr

七、總結(jié)

Docling是一個功能強(qiáng)大的文檔處理庫，支持多種文檔格式和深度解析功能。它提供了統(tǒng)一的文檔表示格式和豐富的導(dǎo)出選項，能夠滿足多種開發(fā)需求。通過簡單的安裝和使用，開發(fā)者可以輕松地將文檔處理集成到自己的項目中。無論是技術(shù)文檔處理還是AI 應(yīng)用開發(fā)，Docling 都是一個值得信賴的選擇。

到此這篇關(guān)于Python使用Docling庫玩轉(zhuǎn)文檔處理的文章就介紹到這了,更多相關(guān)Python Docling文檔處理內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

亚洲乱码中文字幕综合,中国熟女仑乱hd,亚洲精品乱拍国产一区二区三区,一本大道卡一卡二卡三乱码全集资源,又粗又黄又硬又爽的免费视频

Python使用Docling庫玩轉(zhuǎn)文檔處理

目錄

一、背景

二、什么是 Docling

三、安裝 Docling

四、庫函數(shù)使用方法

1. DocumentConverter.convert()

2. export_to_markdown()

3. export_to_json()

4. HierarchicalChunker.chunk()

5. PdfPipelineOptions

五、使用場景示例

場景 1：將 PDF 轉(zhuǎn)換為 Markdown

場景 2：限制文檔大小

場景 3：自定義 PDF 轉(zhuǎn)換選項

場景 4：文檔分塊處理

場景 5：使用 OCR 處理掃描版 PDF

六、常見問題及解決方案

問題 1：TensorFlow 相關(guān)警告

問題 2：Tesseract OCR 安裝問題

問題 3：Tesserocr 安裝失敗

七、總結(jié)

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

亚洲乱码中文字幕综合,中国熟女仑乱hd,亚洲精品乱拍国产一区二区三区,一本大道卡一卡二卡三乱码全集资源,又粗又黄又硬又爽的免费视频

Python使用Docling庫玩轉(zhuǎn)文檔處理

目錄

一、背景

二、什么是 Docling

三、安裝 Docling

四、庫函數(shù)使用方法

1. DocumentConverter.convert()

2. export_to_markdown()

3. export_to_json()

4. HierarchicalChunker.chunk()

5. PdfPipelineOptions

五、使用場景示例

場景 1：將 PDF 轉(zhuǎn)換為 Markdown

場景 2：限制文檔大小

場景 3：自定義 PDF 轉(zhuǎn)換選項

場景 4：文檔分塊處理

場景 5：使用 OCR 處理掃描版 PDF

六、常見問題及解決方案

問題 1：TensorFlow 相關(guān)警告

問題 2：Tesseract OCR 安裝問題

問題 3：Tesserocr 安裝失敗

七、總結(jié)

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

一、背景

三、安裝 Docling

四、庫函數(shù)使用方法

五、使用場景示例

六、常見問題及解決方案

七、總結(jié)