快捷導(dǎo)航

Python使用paddleOCR批量識別pdf的方法

更新時間：2024年03月03日 09:55:40 作者：Python斗羅

PaddleOCR可以在圖像、文本、表格等多種場景下進行文字識別,本文主要介紹了Python使用paddleOCR批量識別pdf的方法,具有一定的參考價值,感興趣的可以了解一下

PaddleOCR是一個基于PaddlePaddle深度學(xué)習(xí)框架的OCR（Optical Character Recognition，光學(xué)字符識別）系統(tǒng)，可以在圖像、文本、表格等多種場景下進行文字識別，具有高速、高精度、高可定制性等特點。在應(yīng)用中，可以使用PaddleOCR進行pdf文件的批量識別。

注意，本文章所述方法僅適用于單欄文本，無表格等復(fù)雜場景的情況。

以下是使用PaddleOCR批量識別pdf的步驟：

1、安裝PaddleOCR
首先需要安裝PaddleOCR，可以參考官方文檔進行安裝。
按照文檔來非常的簡單。無需害怕。https://github.com/PaddlePaddle/PaddleOCR

2、準備pdf文件
將需要識別的pdf文件準備好，可以使用一個掃描版進行測試。

3、使用PaddleOCR進行識別
paddleocr識別pdf的過程是先將pdf變?yōu)閳D片，再識別圖片，最終再拼接出答案。
所以我們在此將過程分為兩個函數(shù)。

如下：

import datetime
import os
import fitz  # fitz就是pip install PyMuPDF

def pdf2png(pdfPath, baseImagePath):
    imagePath=os.path.join(baseImagePath,os.path.basename(pdfPath).split('.')[0])
    startTime_pdf2img = datetime.datetime.now()  # 開始時間
    print("imagePath=" + imagePath)
    if not os.path.exists(imagePath):
        os.makedirs(imagePath)
    pdfDoc = fitz.open(pdfPath)
    totalPage=pdfDoc.pageCount
    for pg in range(totalPage):
        page = pdfDoc[pg]
        rotate = int(0)
        zoom_x = 2
        zoom_y = 2
        mat = fitz.Matrix(zoom_x, zoom_y).prerotate(rotate)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        print(f'正在保存{pdfPath}的第{pg+1}頁，共{totalPage}頁')
        pix.save(imagePath + '/' + f'images_{pg+1}.png')
    endTime_pdf2img = datetime.datetime.now()
    print(f'{pdfDoc}-pdf2img-花費時間={(endTime_pdf2img - startTime_pdf2img).seconds}秒')

if __name__ == "__main__":
    pdfPath = r'./demo-scan.pdf'
    baseImagePath = './imgs'
    pdf2png(pdfPath, baseImagePath)

import os
import cv2
from paddleocr import PPStructure,save_structure_res
from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx
from copy import deepcopy
# 中文測試圖
table_engine = PPStructure(recovery=True,lang='ch')

image_path = './imgs/demo-scan'
save_folder = './txt'
def img2docx(img_path):
    text=[]
    imgs=os.listdir(img_path)
    for img_name in imgs:
        print(os.path.join(img_path,img_name))
        img = cv2.imread(os.path.join(img_path,img_name))
        result = table_engine(img)

        save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

        h, w, _ = img.shape
        res = sorted_layout_boxes(result, w)
        convert_info_docx(img, res, save_folder, os.path.basename(img_path).split('.')[0])

        for line in res:
            line.pop('img')
            print(line)
            for pra in line['res']:
                text.append(pra['text'])
            text.append('\n')
        with open('txt/res.txt', 'w', encoding='utf-8') as f:
            f.write('\n'.join(text))
img2docx(image_path)

以上代碼將會讀取指定目錄下的pdf文件，并將其轉(zhuǎn)換為圖像列表，然后使用PaddleOCR進行識別，最后將識別結(jié)果保存在指定目錄下的文本文件中。

需要注意的是，使用PaddleOCR進行pdf識別時，由于pdf文件通常包含多頁，需要將每一頁的內(nèi)容分別識別，并將其合并成完整的文本內(nèi)容。

另外，由于PaddleOCR的識別結(jié)果可能存在誤識別的情況，需要對識別結(jié)果進行校驗和修正。

到此這篇關(guān)于Python使用paddleOCR批量識別pdf的方法的文章就介紹到這了,更多相關(guān)Python批量識別pdf內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: