python提取word文件中的圖片并上傳阿里云OSS

更新時間：2021年12月22日 15:51:20 作者：夢想橡皮擦

這篇文章主要介紹了通過Python提取Word文件中的所有圖片，并將其上傳至阿里云OSS。文中的示例代碼對學習Python有一定的幫助，快跟隨小編一起學習一下吧

該需求是一個真實的實戰(zhàn)需求，如果你的公司在做題庫類的系統(tǒng)，一定會涉及該方面的內容，所以收藏起來吧。

需求簡單描述如下所示：

1.提取 Word（為了便于解決，統(tǒng)一格式為 docx）中的題干/選項圖片；

2.將其傳遞到云 OSS 上，并返回圖片地址；

3.部分場景，需要將其拼接為 HTML 的 img 標簽進行返回。

實操環(huán)節(jié)

首先你需要準備好云OSS的 AccessKeyId 和 AccessKeySecret ，這兩個值一般由運維工程師提供給你，如果你的公司比較小，沒有運維崗位，那就需要自己去申請并進行配置啦。

云 OSS 的購買和設置的流程非常簡單，創(chuàng)建一個 Bucket 之后，就可以使用了。

然后點擊創(chuàng)建好的 Bucket ，進行權限設置，選擇公共讀即可。

接下來在 Python 文件中編寫如下代碼，并測試是否可以返回 Bucket 對象，下述的字符串一定要寫準確，任意內容錯誤都會報錯，導致 oss 無法鏈接。

AccessKeyId = '你的 AccessKeyId'
AccessKeySecret = '你的 AccessKeySecret'
oos_auth = oss2.Auth(AccessKeyId, AccessKeySecret)
endpoint = 'http://oss-cn-beijing.aliyuncs.com'

bucket = oss2.Bucket(oos_auth, endpoint, 'Bucket 名稱')  
print(bucket)

上述字符串的值，可以在云 OSS 的概覽中找到，如下圖所示。

接下來就進入 Word 圖片的環(huán)節(jié)，讀取文件依舊使用第三方模塊， python-docx 。

在正式開始前，需要準備好一個測試用的 Word 文檔，可以參考下圖設置 Word 文檔的內容。

首先通過 python-docx 讀取文檔中的所有行 paragraphs ，使用如下代碼：

import oss2
import time

from docx import Document

def get_questions():
    document = Document(docx='測試 Word 文檔.docx')
    for p in document.paragraphs:
        print(p.text)

if __name__ == '__main__':
    get_questions()

上述代碼重點為 document.paragraphs ，調用該屬性將逐段落返回文檔內容，然后再通過對象的 .text 屬性，輸出里面的文字。

此時的代碼無法獲取到段落中的圖片，可以使用下述代碼進行提取。

import oss2
import time

from docx import Document

# 獲取 Word 文檔中的圖片
def get_picture(document, paragraph):
    img = paragraph._element.xpath('.//pic:pic')
    if not img:
        return
    img = img[0]
    embed = img.xpath('.//a:blip/@r:embed')[0]
    related_part = document.part.related_parts[embed]
    image = related_part.image
    return image

def get_questions():
    document = Document(docx='測試 Word 文檔.docx')

    for p in document.paragraphs:
    	# 讀取圖片
        img = get_picture(document, p)

        print(img)
        if img is not None:
        	# 輸出圖片名
            print(img.filename)
            # 輸出圖片后綴
            print(img.ext)
            # 輸出圖片的二進制流
        	# print(img.blob)
        print(p.text)

if __name__ == '__main__':
    get_questions()

在上述代碼中，最重要的函數(shù)為 get_picture() 函數(shù)，核心邏輯是由于 docx 文檔是一種 xml 結構，通過 paragraph._element.xpath() 方法可以進行數(shù)據(jù)提取。

讀取數(shù)據(jù)的結果如下所示：

在上文的注釋中，還存在一個圖片屬性 img.blob ，即讀取圖片的二進制流。

拿該文件流即可寫入云 OSS，然后拼接圖片的訪問路徑，最后將其拼接到 img 標簽中即可。

import oss2
import time

from docx import Document

# 獲取 Word 文檔中的圖片
def get_picture(document, paragraph):
    img = paragraph._element.xpath('.//pic:pic')
    if not img:
        return
    img = img[0]
    embed = img.xpath('.//a:blip/@r:embed')[0]
    related_part = document.part.related_parts[embed]
    image = related_part.image
    return image

def ret_up_imgurl(image):
    blob = image.blob
    # 后綴
    ext = image.ext

    AccessKeyId = '你的 AccessKeyId'
    AccessKeySecret = '你的 AccessKeySecret '
    oos_auth = oss2.Auth(AccessKeyId, AccessKeySecret)
    endpoint = 'http://oss-cn-beijing.aliyuncs.com'

    bucket = oss2.Bucket(oos_auth, endpoint, 'Bucket 名稱')  

    base_file_url = 'https://Bucket 名稱.oss-cn-beijing.aliyuncs.com/'
    # 獲取一個文件名
    file_name = str(int(time.time())) + "." + ext
    # 上傳二進制流
    res = bucket.put_object(file_name, blob)
    if res.status == 200:
    	# 返回標簽
        img_format = '<img src="{}" />'
        return img_format.format(base_file_url + file_name)
    else:
        return False

def get_questions():
    document = Document(docx='測試 Word 文檔.docx')

    for p in document.paragraphs:
        print(p.text)
        img = get_picture(document, p)

        print(img)
        if img is not None:
            print(img.filename)
            print(img.ext)
            ret_up_imgurl(img.blob)

if __name__ == '__main__':
    get_questions()

上述代碼重點部分在 bucket.put_object(file_name, blob) ，該方法將圖片二進制流傳遞到OSS空間。?

到此這篇關于python提取word文件中的圖片并上傳阿里云OSS的文章就介紹到這了,更多相關python提取word文件中的圖片內容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: