Python實現(xiàn)提取Excel嵌入圖片并重命名

更新時間：2025年04月10日 11:08:36 作者：一晌小貪歡

我們在日常辦公的時候經(jīng)常需要將Excel中嵌入單元的圖片進行提取,并在提取的時候?qū)⑵渲械哪骋涣凶鳛樘崛〕鰣D片的命名,本文將使用Python實現(xiàn)這一功能,需要的可以了解下

1. 背景介紹

我們在日常辦公的時候經(jīng)常需要將Excel中嵌入單元的圖片進行提取，并在提取的時候?qū)⑵渲械哪骋涣凶鳛樘崛〕鰣D片的命名，然后將圖片存放好！

為此我們可以利用Python將圖片進行提取然后進行保存！

2. 庫的安裝

庫	用途	安裝
xmltodict	讀取xml文件	`pip install xmltodict -i https://pypi.tuna.tsinghua.edu.cn/simple/`
pandas	Excel讀寫	`pip install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple/`
os	獲取路徑	內(nèi)置庫無需安裝
json	讀寫json文件	內(nèi)置庫無需安裝
re	正則表達式	內(nèi)置庫無需安裝
shutil	文件操作	內(nèi)置庫無需安裝
tempfile	創(chuàng)建臨時文件和目錄	內(nèi)置庫無需安裝

3. 主要類與方法

ExcelImageProcessor 類

初始化 (__init__):接收Excel文件路徑、用于重命名的依據(jù)列名（id_column）、包含圖片的列名（image_column）。
創(chuàng)建臨時工作空間 (create_temp_workspace):創(chuàng)建臨時目錄，將Excel文件復(fù)制到臨時目錄并解壓為.zip格式。
清理臨時工作空間 (cleanup_temp_workspace):刪除臨時目錄及其內(nèi)容。
提取圖片ID (extract_image_id):從字符串中提取符合特定正則表達式的圖片ID。
獲取單元格與圖片ID映射 (get_cell_image_mapping):解析sheet.xml文件，獲取單元格位置與圖片ID的映射。
獲取圖片ID與rId映射 (get_image_rId_mapping):解析cellimages.xml文件，獲取圖片ID與rId的映射。
獲取rId與目標(biāo)文件映射 (get_cellimages_rels):解析cellimages.xml.rels文件，獲取rId與目標(biāo)文件路徑的映射。
處理Excel文件 (process_excel):調(diào)用上述方法，構(gòu)建單元格到圖片文件路徑的直接映射。
復(fù)制并重命名圖片 (copy_and_rename_images):根據(jù)映射關(guān)系，將圖片復(fù)制到輸出目錄并重命名。
主流程 (process):調(diào)用上述方法完成整個處理流程。

main 函數(shù)

提供使用示例，設(shè)置默認參數(shù)（如“訂單號”和“圖片”列），調(diào)用ExcelImageProcessor類的process方法執(zhí)行圖片提取與重命名。

關(guān)鍵點

臨時目錄管理: 使用tempfile.mkdtemp()創(chuàng)建臨時目錄，確保操作完成后通過shutil.rmtree()清理。
XML解析: 使用xmltodict庫解析Excel內(nèi)部的XML文件，提取所需信息。
圖片路徑修正: 確保圖片路徑以media/開頭，并正確拼接完整路徑。
異常處理: 在主流程中捕獲異常并打印錯誤信息，確保程序健壯性。

輸出結(jié)果

提取的圖片保存在圖片輸出目錄下，文件名基于指定列（如“訂單號”）重命名。

4、完整代碼

# -*- coding: UTF-8 -*-
'''
@Project ：45-Excel嵌入圖片提取 
@File    ：通用版本.py
@IDE     ：PyCharm 
@Author  ：一晌小貪歡（278865463@qq.com）
@Date    ：2025/3/18 20:49 
'''
import pandas as pd
import os
import re
import xmltodict
import shutil
import tempfile


class ExcelImageProcessor:
    def __init__(self, excel_path, id_column, image_column):
        """
        初始化處理器
        Args:
            excel_path: Excel文件路徑
            id_column: 用于重命名的依據(jù)列名
            image_column: 包含圖片的列名
        """
        self.excel_path = excel_path
        self.id_column = id_column
        self.image_column = image_column
        self.temp_dir = None
        self.extract_dir = None
        self.output_dir = "圖片輸出"

    def create_temp_workspace(self):
        """創(chuàng)建臨時工作空間并復(fù)制Excel文件"""
        self.temp_dir = tempfile.mkdtemp()

        # 復(fù)制Excel文件到臨時目錄
        excel_name = os.path.basename(self.excel_path)
        temp_excel = os.path.join(self.temp_dir, excel_name)
        shutil.copy2(self.excel_path, temp_excel)

        # 創(chuàng)建解壓目錄
        self.extract_dir = os.path.join(self.temp_dir, 'extracted')
        os.makedirs(self.extract_dir)

        # 復(fù)制Excel為zip并解壓
        zip_path = os.path.join(self.temp_dir, 'temp.zip')
        shutil.copy2(temp_excel, zip_path)
        shutil.unpack_archive(zip_path, self.extract_dir, 'zip')

    def cleanup_temp_workspace(self):
        """清理臨時工作空間"""
        if self.temp_dir and os.path.exists(self.temp_dir):
            shutil.rmtree(self.temp_dir)

    def extract_image_id(self, text):
        """從字符串中提取圖片ID"""
        match = re.search(r'ID_[A-F0-9]+', text)
        return match.group() if match else None

    def get_cell_image_mapping(self, sheet_xml_path):
        """
        從sheet.xml文件中提取單元格和圖片ID的映射關(guān)系
        Args:
            sheet_xml_path: sheet.xml文件路徑
        Returns:
            dict: 單元格位置和圖片ID的映射字典
        """
        cell_image_dict = {}

        # 檢查文件是否存在
        if not os.path.exists(sheet_xml_path):
            return cell_image_dict

        # 讀取并解析XML文件
        with open(sheet_xml_path, 'r', encoding='utf-8') as file:
            xml_content = file.read()
            sheet_dict = xmltodict.parse(xml_content)

        # 遍歷worksheet中的sheetData下的row數(shù)據(jù)
        for row in sheet_dict['worksheet']['sheetData']['row']:
            # 遍歷每行中的單元格數(shù)據(jù)
            if 'c' in row:
                for cell in row['c']:
                    # 檢查單元格是否包含DISPIMG函數(shù)
                    if 'v' in cell and 'DISPIMG' in cell['v']:
                        cell_pos = cell['@r']
                        image_id = self.extract_image_id(cell['v'])
                        if image_id:
                            cell_image_dict[cell_pos] = image_id

        return cell_image_dict

    def get_image_rId_mapping(self, cellimages_xml):
        """
        從cellimages.xml文件中提取圖片ID和rId的映射關(guān)系
        Args:
            cellimages_xml: cellimages.xml文件路徑
        Returns:
            dict: 圖片ID和rId的映射字典
        """
        image_rId = {}

        # 讀取并解析XML文件
        with open(cellimages_xml, 'r', encoding='utf-8') as file:
            xml_content = file.read()
            cellimages_dict = xmltodict.parse(xml_content)

        # 遍歷cellimages_dict中的圖片數(shù)據(jù)
        for image in cellimages_dict['etc:cellImages']['etc:cellImage']:
            image_id = image['xdr:pic']['xdr:nvPicPr']['xdr:cNvPr']['@name']
            r_id = image['xdr:pic']['xdr:blipFill']['a:blip']['@r:embed']
            image_rId[image_id] = r_id

        return image_rId

    def get_cellimages_rels(self, cellimages_rels_xml):
        """
        從cellimages.xml.rels文件中讀取并解析關(guān)系映射
        Args:
            cellimages_rels_xml: cellimages.xml.rels文件路徑
        Returns:
            dict: rId和目標(biāo)文件的映射字典
        """
        rels_dict = {}

        # 讀取并解析XML文件
        with open(cellimages_rels_xml, 'r', encoding='utf-8') as file:
            xml_content = file.read()
            rels = xmltodict.parse(xml_content)

        # 遍歷Relationships中的Relationship數(shù)據(jù)
        for rel in rels['Relationships']['Relationship']:
            r_id = rel['@Id']
            # 確保路徑以 media/ 開頭
            target = rel['@Target']
            if not target.startswith('media/'):
                target = f"media/{target}"
            rels_dict[r_id] = target

        return rels_dict

    def process_excel(self):
        """處理Excel文件并提取圖片映射關(guān)系"""
        # 創(chuàng)建臨時工作空間
        self.create_temp_workspace()

        # 構(gòu)建解壓后的文件路徑
        sheet_xml = os.path.join(self.extract_dir, "xl/worksheets/sheet1.xml")
        cellimages_xml = os.path.join(self.extract_dir, "xl/cellimages.xml")
        rels_xml = os.path.join(self.extract_dir, "xl/_rels/cellimages.xml.rels")

        # 獲取各層映射
        cell_to_id = self.get_cell_image_mapping(sheet_xml)
        id_to_rid = self.get_image_rId_mapping(cellimages_xml)
        rid_to_file = self.get_cellimages_rels(rels_xml)

        # 構(gòu)建單元格到文件位置的直接映射
        cell_to_file = {}
        for cell, image_id in cell_to_id.items():
            rid = id_to_rid[image_id]
            file_path = rid_to_file[rid]
            cell_to_file[cell] = file_path

        # 讀取Excel文件
        df = pd.read_excel(self.excel_path)

        return df, cell_to_file

    def copy_and_rename_images(self, df, cell_to_file):
        """
        復(fù)制并重命名圖片
        Args:
            df: DataFrame對象
            cell_to_file: 單元格到文件路徑的映射
        """
        # 創(chuàng)建輸出目錄
        if not os.path.exists(self.output_dir):
            os.makedirs(self.output_dir)

        # 獲取圖片列的列號
        image_column_idx = None
        for idx, col in enumerate(df.columns):
            if col == self.image_column:
                image_column_idx = idx
                break
        column_letter = chr(65 + image_column_idx)

        # 處理每一行
        for index, row in df.iterrows():
            id_value = row[self.id_column]
            cell_ref = f"{column_letter}{index + 2}"

            if cell_ref in cell_to_file:
                # 獲取源圖片路徑
                image_rel_path = cell_to_file[cell_ref]
                # 修正圖片路徑，確保包含完整的xl路徑
                image_path = os.path.join(self.extract_dir, "xl", image_rel_path)

                # 獲取文件擴展名
                _, ext = os.path.splitext(image_rel_path)

                # 構(gòu)建新的文件名和路徑
                new_filename = f"{id_value}{ext}"
                new_path = os.path.join(self.output_dir, new_filename)

                # 復(fù)制并重命名圖片
                if os.path.exists(image_path):
                    shutil.copy2(image_path, new_path)
                else:
                    print(f"警告: 找不到圖片文件 {image_path}")

    def process(self):
        """處理主流程"""
        try:
            # 處理Excel獲取映射關(guān)系
            df, cell_to_file = self.process_excel()

            # 復(fù)制并重命名圖片
            self.copy_and_rename_images(df, cell_to_file)

            print(f"處理完成！圖片已保存到 {self.output_dir} 目錄")

        except Exception as e:
            print(f"處理過程中出現(xiàn)錯誤: {str(e)}")
            raise
        finally:
            # 在所有操作完成后再清理臨時目錄
            self.cleanup_temp_workspace()


def main():
    # 使用示例
    """
        傳入 Excel 路徑、命名列、圖片列 三個參數(shù)
        自動根據(jù)順序命名圖片，并導(dǎo)出到 “圖片輸出” 文件夾
    """
    if not os.path.exists('./圖片輸出/'):
        os.makedirs('./圖片輸出/')
    excel_path = "./數(shù)據(jù)源/" + os.listdir('./數(shù)據(jù)源/')[0]
    processor = ExcelImageProcessor(
        excel_path=excel_path,
        id_column="訂單號",
        image_column="圖片"
    )
    processor.process()


if __name__ == "__main__":
    main()