使用Python分割并高效處理PDF大文件詳解

更新時間：2025年03月10日 09:45:38 作者：夢想畫家

在處理大型PDF文件時,將它們分解成更小、更易于管理的塊通常是有益的,本文將為大家介紹一下如何使用Python和為Unstructured.io庫實現(xiàn)分割PDF吧

在處理大型PDF文件時，將它們分解成更小、更易于管理的塊通常是有益的。這個過程稱為分區(qū)，它可以提高處理效率，并使分析或操作文檔變得更容易。在本文中，我們將討論如何使用Python和為Unstructured.io庫將PDF文件劃分為更小的部分。

我們將使用兩個Python庫來完成此任務：

PyPDF2：一個可以讀、寫、合并和分割PDF文件的庫。

Unstructured.io：一個可以使用文檔圖像分析模型分割PDF文檔的庫。

下面是完成這個任務的Python代碼：

from PyPDF2 import PdfReader, PdfWriter
from unstructured.partition.pdf import partition_pdf

import os
from os import path

# Create the output directory if it doesn't exist
# os.makedirs('./output', exist_ok=True)
path = path.abspath(path.dirname(__file__))

# pdf_file = path + '/sample01.pdf'

filename =  path + "/sample02.pdf"

# Read the original PDF
input_pdf = PdfReader(f'{filename}')

batch_size = 2
num_batches = len(input_pdf.pages) // batch_size + 1

filename = path + "/output" 
# Extract batches of 100 pages from the PDF
for b in range(num_batches):
    writer = PdfWriter()

    # Get the start and end page numbers for this batch
    start_page = b * batch_size
    end_page = min((b+1) * batch_size, len(input_pdf.pages))

    # Add pages in this batch to the writer
    for i in range(start_page, end_page):
        writer.add_page(input_pdf.pages[i])

    # Save the batch to a separate PDF file
    batch_filename = f'{filename}-batch{b+1}.pdf'
    with open(batch_filename, 'wb') as output_file:
        writer.write(output_file)

    # Now you can use the `partition_pdf` function from Unstructured.io to analyze the batch
    elements = partition_pdf(filename=batch_filename)
    print(elements)
    # Do something with `elements`...
    
    # This will process without issue
    # 抽取表格數(shù)據(jù)
	elements = partition_pdf("copy-protected.pdf", strategy="hi_res")

第一步：讀PDF文件

首先，我們從PyPDF2庫導入必要的類：PdfReader和PdfWriter。PdfReader類用于讀取原始PDF文件，該文件存儲在名為“exam-prep”的子目錄中。

步驟2：分區(qū)PDF

我們決定批大小，即PDF的每個塊將包含的頁數(shù)。在本例中，我們選擇了100頁的批處理大小，但這可以根據(jù)您的需要進行調整。

然后通過將PDF中的總頁數(shù)除以批大小來計算批數(shù)量。添加1以確保在頁面總數(shù)不是批大小的倍數(shù)時捕獲所有剩余頁面。

步驟3：寫PDF塊

接下來，循環(huán)遍歷每個批處理，為每個批處理創(chuàng)建一個新的PdfWriter對象。對于每個批處理，我們計算起始頁碼和結束頁碼，并使用add_page方法將該范圍內的每個頁碼添加到PdfWriter。

一旦添加了批處理的所有頁面，我們將它們寫入‘output’子目錄下的新PDF文件中。每個塊的文件名包括原始文件名和批號。

步驟4：分析PDF塊

將PDF分成更小的塊后，現(xiàn)在可以使用來自非結構化的partition_pdf函數(shù)。IO庫來分析每個批處理。該函數(shù)使用文檔圖像分析模型對PDF文檔進行分段，并返回已解析PDF文檔頁面中出現(xiàn)的元素列表。

最后總結

將大型PDF文件劃分為更小的塊可以使它們更容易、容錯和消耗更少的內存。

方法補充

下面小編為大家整理了其他Python分割PDF的相關方法，感興趣的可以了解下

方法一：批量分割PDF文件

現(xiàn)在，編寫一個腳本來批量分割PDF文件。假設有一個大的PDF文件，需要每5頁切割成一個小文件。

import PyPDF2

def split_pdf(input_pdf, output_prefix, pages_per_file=5):
    with open(input_pdf, 'rb') as file:
        pdf_reader = PyPDF2.PdfFileReader(file)
        num_pages = pdf_reader.numPages

        for i in range(0, num_pages, pages_per_file):
            pdf_writer = PyPDF2.PdfFileWriter()
            output_pdf = f'{output_prefix}_{i // pages_per_file + 1}.pdf'

            for j in range(i, min(i + pages_per_file, num_pages)):
                page = pdf_reader.getPage(j)
                pdf_writer.addPage(page)

            with open(output_pdf, 'wb') as new_file:
                pdf_writer.write(new_file)

            print(f'已創(chuàng)建文件: {output_pdf}')

# 示例調用
split_pdf('large_file.pdf', 'output_split')

方法二：批量分割PDF

def main():
    directory = input("請輸入PDF文件所在目錄：")
    pdf_files = get_pdf_files(directory)
    split_rule = get_split_rule()
    output_directory = get_output_directory()

    for file in pdf_files:
        output_files = split_pdf(file, split_rule)
        save_output_files(output_files, output_directory)

    print("分割完成！")

if __name__ == "__main__":
    main()

到此這篇關于使用Python分割并高效處理PDF大文件詳解的文章就介紹到這了,更多相關Python PDF處理內容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: