python實(shí)現(xiàn)pdf轉(zhuǎn)換成word/txt純文本文件

更新時(shí)間：2018年06月07日 14:41:37 作者：initiallysunny

這篇文章主要為大家詳細(xì)介紹了python實(shí)現(xiàn)pdf轉(zhuǎn)換成word和txt純文本文件，具有一定的參考價(jià)值，感興趣的小伙伴們可以參考一下

本文實(shí)例為大家分享了python實(shí)現(xiàn)pdf轉(zhuǎn)word/txt，供大家參考，具體內(nèi)容如下

依賴(lài)包：pdfminer3k

可以通過(guò)pip安裝；也可以到官網(wǎng)下載，解壓，進(jìn)入文件夾，輸入命令setup.py install安裝軟件。

源代碼：

#!/usr/bin/python 
# -*- coding: utf-8 -*- 
 
import sys 
import importlib 
importlib.reload(sys) 
 
from pdfminer.pdfparser import PDFParser,PDFDocument 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter import PDFPageAggregator 
from pdfminer.layout import * 
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed 
 
''''' 
解析pdf文件，獲取文件中包含的各種對(duì)象 
''' 
 
# 解析pdf文件函數(shù) 
def parse(pdf_path): 
  fp = open(pdf_path, 'rb') # 以二進(jìn)制讀模式打開(kāi) 
  # 用文件對(duì)象來(lái)創(chuàng)建一個(gè)pdf文檔分析器 
  parser = PDFParser(fp) 
  # 創(chuàng)建一個(gè)PDF文檔 
  doc = PDFDocument() 
  # 連接分析器 與文檔對(duì)象 
  parser.set_document(doc) 
  doc.set_parser(parser) 
 
  # 提供初始化密碼 
  # 如果沒(méi)有密碼 就創(chuàng)建一個(gè)空的字符串 
  doc.initialize() 
 
  # 檢測(cè)文檔是否提供txt轉(zhuǎn)換，不提供就忽略 
  if not doc.is_extractable: 
    raise PDFTextExtractionNotAllowed 
  else: 
    # 創(chuàng)建PDf 資源管理器 來(lái)管理共享資源 
    rsrcmgr = PDFResourceManager() 
    # 創(chuàng)建一個(gè)PDF設(shè)備對(duì)象 
    laparams = LAParams() 
    device = PDFPageAggregator(rsrcmgr, laparams=laparams) 
    # 創(chuàng)建一個(gè)PDF解釋器對(duì)象 
    interpreter = PDFPageInterpreter(rsrcmgr, device) 
 
    # 用來(lái)計(jì)數(shù)頁(yè)面，圖片，曲線(xiàn)，figure，水平文本框等對(duì)象的數(shù)量 
    num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0, 0 
 
    # 循環(huán)遍歷列表，每次處理一個(gè)page的內(nèi)容 
    for page in doc.get_pages(): # doc.get_pages() 獲取page列表 
      num_page += 1 # 頁(yè)面增一 
      interpreter.process_page(page) 
      # 接受該頁(yè)面的LTPage對(duì)象 
      layout = device.get_result() 
      for x in layout: 
        if isinstance(x,LTImage): # 圖片對(duì)象 
          num_image += 1 
        if isinstance(x,LTCurve): # 曲線(xiàn)對(duì)象 
          num_curve += 1 
        if isinstance(x,LTFigure): # figure對(duì)象 
          num_figure += 1 
        if isinstance(x, LTTextBoxHorizontal): # 獲取文本內(nèi)容 
          num_TextBoxHorizontal += 1 # 水平文本框?qū)ο笤鲆?
          # 保存文本內(nèi)容 
          with open(r'test.doc', 'a',encoding='utf-8') as f:  #生成doc文件的文件名及路徑 
            results = x.get_text() 
            f.write(results) 
            f.write('\n') 
    print('對(duì)象數(shù)量：\n','頁(yè)面數(shù)：%s\n'%num_page,'圖片數(shù)：%s\n'%num_image,'曲線(xiàn)數(shù)：%s\n'%num_curve,'水平文本框：%s\n' 
       %num_TextBoxHorizontal) 
 
 
if __name__ == '__main__': 
  pdf_path = r'test.pdf' #pdf文件路徑及文件名 
  parse(pdf_path)

此腳本只能將pdf文件轉(zhuǎn)換成純文本文件，沒(méi)有任何格式。

以上就是本文的全部?jī)?nèi)容，希望對(duì)大家的學(xué)習(xí)有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章: