快捷導航

Python調(diào)用ollama本地大模型進行批量識別PDF

更新時間：2025年03月10日 09:15:07 作者：月野難潯丶

現(xiàn)在市場上有很多PDF文件的識別,然而隨著AI的興起,本地大模型的部署,這些成為一種很方便的方法,本文我們就來看看Python如何調(diào)用ollama本地大模型進行PDF相關(guān)操作吧

現(xiàn)在市場上有很多PDF文件的識別，轉(zhuǎn)化，等等。有些業(yè)務(wù)可能需要總結(jié)摘要和關(guān)鍵詞等等一系列的操作。然而隨著AI的興起，本地大模型的部署，這些成為一種很方便的方法，接下來我將為各位介紹我所使用的方法。

本篇文章旨在自動化處理 PDF 文檔，提取并清理文本數(shù)據(jù)，然后使用一種大型模型生成摘要和關(guān)鍵詞。最后，處理結(jié)果會被整理并輸出到 Excel 文件中，便于后續(xù)分析和查看。

人工智能（AI）是一種模擬人類智能的科技，它已經(jīng)在現(xiàn)代科技中得到廣泛應(yīng)用，并且是未來發(fā)展的重點領(lǐng)域之一。人工智能應(yīng)用領(lǐng)域多樣，包括機器學習和數(shù)據(jù)分析、自然語言處理、機器視覺、自動化和機器人等。未來發(fā)展趨勢包括深度學習和神經(jīng)網(wǎng)絡(luò)、增強學習、多模態(tài)融合和泛用人工智能。總體而言，人工智能的應(yīng)用將繼續(xù)擴大，并在不同領(lǐng)域帶來更多的創(chuàng)新和進步。（廢話~~~）

首先我們需要下載兩個庫PyPDF2以及ollama庫。（通過ollama部署好本地大模型：qwen2：14b或者其他大模型，這里部署步驟不再贅述，已經(jīng)有很成熟的步驟）方便調(diào)用~~終端輸入如下指令。

pip install PyPDF2
pip install ollama

PyPDF2是一個用于合并、分割、提取文本和元數(shù)據(jù)等PDF文件操作的Python庫。它建立在PDFMiner庫的基礎(chǔ)上，提供了更高級別的功能和易用性。ollama庫是一個用于機器學習和深度學習的Python庫。它提供了一系列功能強大的工具和函數(shù)，用于數(shù)據(jù)處理、模型構(gòu)建、特征工程、模型選擇和評估等任務(wù)。兩者的結(jié)合則成為了如今的成果。話不多說，直接上代碼。

首先，在我們進行批量處理PDF文件時，先要了解如何處理單個PDF，然后再進行實現(xiàn)批量PDF的處理實現(xiàn)，如下是如何處理單個PDF，并設(shè)有異常處理，在處理PDF時存在部分亂碼，可能是包含有圖片格式的問題，故此設(shè)置了清洗文本，只保留了可以打印的字符，在提交給大模型進行回答時不受影響，個人沒有進行未清洗測試。

def clean_text(text):
    text = re.sub(r'[^\x20-\x7E]+', '', text)  # 只保留可打印的 ASCII 字符
    return re.sub(r'\s+', ' ', text).strip()
def process_pdf(pdf_path, output_path):
    try:
        with open(pdf_path, "rb") as file:
            reader = PyPDF2.PdfReader(file)
            with open(output_path, "w", encoding='utf-8') as output_file:
                for page in reader.pages:
                    text = page.extract_text()
                    if text:  # 檢查是否成功提取文本
                        clean_text_result = clean_text(text)  # 清理文本
                        output_file.write(clean_text_result + "\n")  # 寫入文件
                    else:
                        output_file.write("未提取到有效文本\n")
    except FileNotFoundError:
        print(f"文件未找到: {pdf_path}")
        return False
    except PyPDF2.errors.PdfReadError:
        print(f"無法讀取PDF文件: {pdf_path}")
        return False
    except Exception as e:
        print(f"處理PDF文件時發(fā)生錯誤: {pdf_path}, 錯誤信息: {e}")
        return False
    return True

接下來是定義超時處理異常類，在后面進行測試時發(fā)現(xiàn)，部分PDF通過這里無法執(zhí)行，就會一直卡著，增加超時處理，更方便后續(xù)進程的實現(xiàn)。

# 定義超時處理異常類
class TimeoutException(Exception):
    pass
 
# 定義帶超時功能的線程類
class TimeoutThread(threading.Thread):
    """
    允許超時處理的線程類。
    """ 
    def __init__(self, target, args=(), kwargs={}):
        threading.Thread.__init__(self)
        self.target = target
        self.args = args
        self.kwargs = kwargs
        self.result = None
        self.exception = None
 
    def run(self):
        try:
            self.result = self.target(*self.args, **self.kwargs)
        except Exception as e:
            self.exception = e
 
    def join(self, timeout=None):
        super(TimeoutThread, self).join(timeout)
        if self.is_alive():
            raise TimeoutException("處理超時")
        if self.exception:
            raise self.exception
        return self.result

這段是處理指定文件夾中的所有PDF文件，并讀取PDF識別后的txt文件中的文章信息，提交給本地大模型，我這里使用的qwen2.5：14b，總體上來說，qwen2.5還是好用的，并將結(jié)果保存到EXCEL中。至于替換信息是因為，qwen2.5給到的返回信息也是需要清理的。

def process_folder(folder_path, output_folder, excel_path):
    """
    處理指定文件夾中的所有PDF文件，并將結(jié)果保存到Excel文件中。
    """ 
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
 
    pdf_files = glob.glob(os.path.join(folder_path, "*.pdf"))
    results = []
    total_files = len(pdf_files)
    processed_files = 0
    errors = []
    unprocessed_files = []
 
    for pdf_file in pdf_files:
        base_name = os.path.basename(pdf_file).replace(".pdf", ".txt")
        output_path = os.path.join(output_folder, base_name)
        success = process_pdf(pdf_file, output_path)
 
        if not success:
            errors.append(pdf_file)
            continue
 
        with open(output_path, "r", encoding='utf-8') as file:
            content = file.read()
 
        try:
            # 使用線程實現(xiàn)超時處理
            def process_model():
                title = base_name.split(".txt")[0]
                res = ollama.chat(model='qwen2.5:14b', stream=False, messages=[{"role": "user", "content": f"{content}總結(jié)成摘要和關(guān)鍵詞"}], options={"temperature": 0})
                summary = res['message']['content'].split('### 摘要\n\n')[1].split('\n\n### 關(guān)鍵詞')[0]
                keywords = res['message']['content'].split('### 關(guān)鍵詞\n\n')[1].split('\n- ')[1:]
                keywords = '、'.join(keywords)
                results.append({"文件名": title, "摘要": summary, "關(guān)鍵詞": keywords})
                print(res)
 
            timeout_thread = TimeoutThread(target=process_model)
            timeout_thread.start()
            timeout_thread.join(timeout=30)
 
        except TimeoutException:
            print(f"處理大模型時超時: {pdf_file}")
            errors.append(pdf_file)
        except Exception as e:
            print(f"處理大模型時發(fā)生錯誤: {pdf_file}, 錯誤信息: {e}")
            errors.append(pdf_file)
 
        processed_files += 1
        print(f"進度: {processed_files}/{total_files} 文件已處理")
 
        # 每次處理完一個文件后保存Excel文件
        write_to_excel(results, excel_path)
 
    # 記錄未處理的文件
    unprocessed_files = pdf_files[processed_files:]
 
    return results, errors, unprocessed_files

返回的信息如圖所示，所以我們需要進一步處理。

最后我們將總結(jié)出來的關(guān)鍵詞，文章摘要，以及對應(yīng)的PDF標題寫入EXCEL中。

def write_to_excel(results, excel_path):
    df = pd.DataFrame(results)
    df.to_excel(excel_path, index=False)

最后加上我們的主函數(shù)，完整代碼如下：

import PyPDF2
import re
import ollama
import os
import glob
import pandas as pd
import threading
import time
 
# 定義函數(shù)來去除特殊空格和非法字符
def clean_text(text):
    # 移除特定的非法字符
    text = re.sub(r'[^\x20-\x7E]+', '', text)  # 只保留可打印的 ASCII 字符
    # 替換多個空格
    return re.sub(r'\s+', ' ', text).strip()
 
# 定義函數(shù)來處理單個PDF文件
def process_pdf(pdf_path, output_path):
    """
    處理單個PDF文件，提取文本并輸出到指定路徑。
    """ 
    try:
        with open(pdf_path, "rb") as file:
            reader = PyPDF2.PdfReader(file)
            with open(output_path, "w", encoding='utf-8') as output_file:
                for page in reader.pages:
                    text = page.extract_text()
                    if text:  # 檢查是否成功提取文本
                        clean_text_result = clean_text(text)  # 清理文本
                        output_file.write(clean_text_result + "\n")  # 寫入文件
                    else:
                        output_file.write("未提取到有效文本\n")
    except FileNotFoundError:
        print(f"文件未找到: {pdf_path}")
        return False
    except PyPDF2.errors.PdfReadError:
        print(f"無法讀取PDF文件: {pdf_path}")
        return False
    except Exception as e:
        print(f"處理PDF文件時發(fā)生錯誤: {pdf_path}, 錯誤信息: {e}")
        return False
    return True
 
# 定義超時處理異常類
class TimeoutException(Exception):
    pass
 
# 定義帶超時功能的線程類
class TimeoutThread(threading.Thread):
    """
    允許超時處理的線程類。
    """ 
    def __init__(self, target, args=(), kwargs={}):
        threading.Thread.__init__(self)
        self.target = target
        self.args = args
        self.kwargs = kwargs
        self.result = None
        self.exception = None
 
    def run(self):
        try:
            self.result = self.target(*self.args, **self.kwargs)
        except Exception as e:
            self.exception = e
 
    def join(self, timeout=None):
        super(TimeoutThread, self).join(timeout)
        if self.is_alive():
            raise TimeoutException("處理超時")
        if self.exception:
            raise self.exception
        return self.result
 
# 定義函數(shù)來處理文件夾中的所有PDF文件
def process_folder(folder_path, output_folder, excel_path):
    """
    處理指定文件夾中的所有PDF文件，并將結(jié)果保存到Excel文件中。
    """ 
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
 
    pdf_files = glob.glob(os.path.join(folder_path, "*.pdf"))
    results = []
    total_files = len(pdf_files)
    processed_files = 0
    errors = []
    unprocessed_files = []
 
    for pdf_file in pdf_files:
        base_name = os.path.basename(pdf_file).replace(".pdf", ".txt")
        output_path = os.path.join(output_folder, base_name)
        success = process_pdf(pdf_file, output_path)
 
        if not success:
            errors.append(pdf_file)
            continue
 
        with open(output_path, "r", encoding='utf-8') as file:
            content = file.read()
 
        try:
            # 使用線程實現(xiàn)超時處理
            def process_model():
                title = base_name.split(".txt")[0]
                res = ollama.chat(model='qwen2.5:14b', stream=False, messages=[{"role": "user", "content": f"{content}總結(jié)成摘要和關(guān)鍵詞"}], options={"temperature": 0})
                summary = res['message']['content'].split('### 摘要\n\n')[1].split('\n\n### 關(guān)鍵詞')[0]
                keywords = res['message']['content'].split('### 關(guān)鍵詞\n\n')[1].split('\n- ')[1:]
                keywords = '、'.join(keywords)
                results.append({"文件名": title, "摘要": summary, "關(guān)鍵詞": keywords})
                print(res)
 
            timeout_thread = TimeoutThread(target=process_model)
            timeout_thread.start()
            timeout_thread.join(timeout=30)
 
        except TimeoutException:
            print(f"處理大模型時超時: {pdf_file}")
            errors.append(pdf_file)
        except Exception as e:
            print(f"處理大模型時發(fā)生錯誤: {pdf_file}, 錯誤信息: {e}")
            errors.append(pdf_file)
 
        processed_files += 1
        print(f"進度: {processed_files}/{total_files} 文件已處理")
 
        # 每次處理完一個文件后保存Excel文件
        write_to_excel(results, excel_path)
 
    # 記錄未處理的文件
    unprocessed_files = pdf_files[processed_files:]
 
    return results, errors, unprocessed_files
 
# 定義函數(shù)來將結(jié)果寫入Excel文件
def write_to_excel(results, excel_path):
    """
    將處理結(jié)果寫入指定的Excel文件。
    """ 
    df = pd.DataFrame(results)
    df.to_excel(excel_path, index=False)
 
# 主程序
if __name__ == "__main__":
    a = input("PDF文件夾路徑:")
    b = input("TXT文件輸出路徑：")
    c = input("EXCEl文件輸出路徑:")
    folder_path = fr"{a}"  # 文件夾路徑
    output_folder = fr""  # TXT文件輸出路徑
    excel_path = fr"{c}\results.xlsx"  # Excel文件輸出路徑
 
    results, errors, unprocessed_files = process_folder(folder_path, output_folder, excel_path)
    print(f"所有PDF文件已處理完畢，結(jié)果已保存到 {excel_path}")
    if errors:
        print("以下PDF文件處理失敗:")
        for error in errors:
            print(error)
    if unprocessed_files:
        print("以下PDF文件未處理:")
        for unprocessed in unprocessed_files:
            print(unprocessed)

附輸出結(jié)果以及EXCEL表。