如何使用Java將word解析出來(包含格式和圖片)

更新時間：2023年12月23日 10:35:18 作者：未婚男子王某

今天遇到一個讀取word模板內容的需求,下面這篇文章主要給大家介紹了關于如何使用Java將word解析出來,包含格式和圖片,文中通過代碼介紹的非常詳細,需要的朋友可以參考下

1、需求：

a. 將word中的內容按照層級結構解析出來

b. 不區(qū)分文件的后綴

c. 包含word的樣式

2、思路：

總體思路分為存和取，存的是文檔的標題和內容、圖片等；取的是文檔的樹形結構。

(1). 存：將word中的標題、內容、圖片獲取出來并進行存儲

a. 上傳文檔時，獲取到文檔的名稱，存儲到數(shù)據(jù)庫表中，產生一個id，即documentId

b. 解析word之后，按照順序遍歷獲取每一個標題進行存儲，父標題和子標題之間使用parentId進行關聯(lián)，即子標題中字段parentId是父標題的id

c. 在所有的標題中都添加一個documentId方便后期范圍查詢

d. 根節(jié)點即一級標題的父id為0

e. 查到的內容如果是p標簽，則與自己的上一級進行關聯(lián)，即p標簽的父id為它的上一級標簽id

f. 關聯(lián)完成之后，生成一個樹形結構，并遍歷存儲到第三方平臺的數(shù)據(jù)庫中。

(2). ?。焊鶕?jù)documentId查詢該文檔的樹形結構并返回

a. 根據(jù)documentId查詢該文檔的所有標題并存儲到集合中去

b. 對集合進行遍歷，為了充分使用遞歸，規(guī)定最頂層的標題的父節(jié)點為0，從父節(jié)點為0的元素開始遞歸，并將產生的結果生成樹形結構并返回。

3、注意：

(1). 因為自己做的主要是在第三方平臺上的操作，所以類似于DmeTestRequestUtil.getDmeResult() 這樣的方法主要是跟第三方平臺的交互，出于安全性考慮就不放出來了。

(2). 包含word的樣式：要思考的時怎么才能夠獲得word的樣式，即獲得html文件

(3). 在解析的過程中樣式是以行內式還是其他的效果呈現(xiàn)：可以轉換為行內式

(4). 為什么不用poi：使用poi最惡心的是要考慮word的版本問題

(5). 使用的依賴主要是什么：aspose、jsoup

(6). 層級結構解析主要是判斷h、p的順序關系

4、代碼：

(1). 依賴

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.2</version>
</dependency>
<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-words</artifactId>
    <version>15.8.0</version>
</dependency>

(2). 解析word并保存到第三方接口中代碼邏輯

/**
     * 解析word
     * @param multipartFile 前端接收的文件，根據(jù)自己的需求也可以將MultipartFile轉換為File
     * @return TitleTreeVO 存放標題的實體
     * @author WangKuo
     * @date 2023/7/27 11:01
     */
public TitleTreeVO wordAnalysis(MultipartFile multipartFile) throws IOException {
        byte[] byteArr = multipartFile.getBytes();
        InputStream inputStream = new ByteArrayInputStream(byteArr);
        List<DocumentContentVO> documentContentVOList = new LinkedList<>();
        TitleTreeVO titleTreeVO = new TitleTreeVO();
        try {
            // 把流轉化為Document
            com.aspose.words.Document doc = new com.aspose.words.Document(inputStream);
            // 設置轉化的格式，HtmlSaveOptions轉換為HTML格式
            HtmlSaveOptions saveOptions = new HtmlSaveOptions();
            saveOptions.setExportImagesAsBase64(false);
            // 將所有word中的圖片放在臨時文件夾中，并將html中的鏈接替換為臨時文件夾中絕對路徑
            String property = System.getProperty("java.io.tmpdir");
            saveOptions.setImagesFolder(property);
            org.apache.commons.io.output.ByteArrayOutputStream baos = new ByteArrayOutputStream();
            doc.save(baos, saveOptions);
            String token = DmeTestRequestUtil.getToken();
            // 將html文件轉化為Document，方便后續(xù)使用jsoup的操作
            org.jsoup.nodes.Document htmlDoc = Jsoup.parse(baos.toString());
            // 設置html里面的圖片src路徑
            this.setImagePath(htmlDoc, token);
            // 存儲word文檔的名稱
            String substring = multipartFile.getOriginalFilename().substring(0, multipartFile.getOriginalFilename().lastIndexOf("."));
            JSONObject docParam = this.getDocParam(substring);
            String saveDocUrl = "https://dme.cn-south-4.huaweicloud.com/rdm_hwdmeverify_app/publicservices/api/DocumentSave/create";
            // 首先根據(jù)文檔名稱生成一條document的數(shù)據(jù)，產生的id將在標題實體中進行關聯(lián)
            String dmeResult = DmeTestRequestUtil.getDmeResult(saveDocUrl, docParam, token);
            JSONObject jsonObject1 = JSONObject.parseObject(dmeResult);
            List data1 = jsonObject1.getObject("data", List.class);
            JSONObject jsonObjectData1 = (JSONObject) data1.get(0);
            String id = jsonObjectData1.getString("id");//文檔id
            // 存儲文檔的第一個標題的返回結果，其中包含該節(jié)點的id和title
            documentContentVOList = this.exactContentFromHtml(htmlDoc);
            this.dmeSave(documentContentVOList, id, "0", token);//第一個標題的父ID默認為0
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            inputStream.close();
        }
        return titleTreeVO;
    }

/**
  	 * 在解析為html文件的時候需要將圖片的地址進行一個替換，由最初的臨時文件地址替換為圖片在服務器上的位置
     * 設置圖片的路徑（src）
     * @param document 轉換為HTML的文檔內容
     * @return void
     * @author Wangkuo
     * @date 2023/7/25 21:40
     */
    private void setImagePath(Document document) throws IOException {
        Elements imgs = document.select("img");
        String token = DmeTestRequestUtil.getToken();
        for (Element img : imgs) {
          	// 獲取出html中src內的地址值
            String src = img.attr("src");
          	// 通過地址查到對應的文件
            File file = new File(src);
            FileInputStream input = new FileInputStream(file);
          	// 將file轉化為MultipartFile
						MultipartFile multipartFile =new MockMultipartFile("file", file.getName(), "text/plain", IOUtils.toByteArray(input));
            // 該部分主要是第三方接口設置的必須傳的參數(shù)，在這里我就先設置為定值，因為這些不干擾我的需求結果
						FormVo formVo = new FormVo();
            formVo.setAttributeName("File");
            formVo.setModelName("Document");
            formVo.setApplicationId("-1");
            String uploadImgUrl = "圖片作為文件，進行上傳";
            String uploadImage = DmeTestRequestUtil.getDmeResultUploadFile(uploadImgUrl, multipartFile, formVo, token);
            JSONObject uploadImgJs = JSONObject.parseObject(uploadImage);
            List data = uploadImgJs.getObject("data", List.class);
          	// 上傳完成后，第三方接口會返回一個文件的id，我可以根據(jù)這個id進行文件的預覽和下載
            String id = (String) data.get(0);
            //上傳文件file 并返回上傳的路徑；將路徑拼接出來，并替換到html的document中
            String imgPath = "/api/dme-library/LibraryFolder/preview?fileId="+id;
            img.attr("src",imgPath);
						input.close();
          	// 刪除臨時文件夾中存儲的文件
            file.deleteOnExit();
        }
    }

/**
     * 該部分主要是第三方接口在調用時規(guī)定該接口的參數(shù)格式
     * 拼接參數(shù)
     * @param name 
     * @return com.alibaba.fastjson.JSONObject
     * @author Wangkuo
     * @date 2023/7/27 11:23
     */
    private JSONObject getDocParam(String name) {
        Map<String, Object> mapStr = new HashMap<>();
        Map<String, Object> paramMap = new HashMap<>();
        paramMap.put("name", name);
        mapStr.put("params", paramMap);
        JSONObject jsonObject = new JSONObject(mapStr);
        return jsonObject;
    }

/**
     * 處理樹形結構
     * @param htmlDoc 
     * @return java.util.List<DocumentContentVO>
     * @author Wangkuo
     * @date 2023/7/27 11:26
     */
    private List<DocumentContentVO> exactContentFromHtml(Document htmlDoc) throws Exception {
        Elements eleList = htmlDoc.getElementsByTag("h1");
        if (eleList == null || eleList.size() == 0) {
            throw new Exception("上傳的文件中不存在一級標題，請檢查！");
        }

        Element hElement = htmlDoc.selectFirst("h1");//從第一個標題1 開始往下找
        //Elements pElement = htmlDoc.select("h1");
        List<DocumentContentVO> allTreeList = new ArrayList<>();
        List<DocumentContentVO> list2 = new ArrayList<>();
        List<DocumentContentVO> list3 = new ArrayList<>();
        List<DocumentContentVO> list4 = new ArrayList<>();
        DocumentContentVO b1Map = new DocumentContentVO();
        DocumentContentVO b2Map = new DocumentContentVO();
        DocumentContentVO b3Map = new DocumentContentVO();
        DocumentContentVO b4Map = new DocumentContentVO();
        DocumentContentVO bMap = b1Map;//記錄當前map
        //先將第一個標題 放入
        int i = 1;
        b1Map.setTitle(hElement.toString());
        b1Map.setIndex(i);
        allTreeList.add(b1Map);
        while (hElement.nextElementSibling() != null) { //如果存在下一個標題
            i++;
            hElement = hElement.nextElementSibling();
            String nodeName = hElement.nodeName();
            String s = hElement.tagName();
            //System.out.println(s);
            if (Objects.equals(nodeName, "h1")) {
                b1Map = new DocumentContentVO();
                bMap = b1Map;
                b1Map.setTitle(hElement.toString());
                b1Map.setIndex(i);
                allTreeList.add(b1Map);
                list2 = new ArrayList<>();
            } else if (Objects.equals(nodeName, "h2")) {
                b2Map = new DocumentContentVO();
                bMap = b2Map;
                list3 = new ArrayList<>();
                b2Map.setTitle(hElement.toString());
                b2Map.setIndex(i);
                list2.add(b2Map);
                b1Map.setChildList(list2);
            } else if (Objects.equals(nodeName, "h3")) {
                b3Map = new DocumentContentVO();
                bMap = b3Map;
                b3Map.setTitle(hElement.toString());
                b3Map.setIndex(i);
                list3.add(b3Map);
                b2Map.setChildList(list3);
            } else if (Objects.equals(nodeName, "h4")) {
                b4Map = new DocumentContentVO();
                bMap = b4Map;
                b4Map.setTitle(hElement.toString());
                b4Map.setIndex(i);
                list4.add(b4Map);
                b3Map.setChildList(list4);
            } else {
                bMap.setContent(bMap.getContent() == null ? hElement.toString() : bMap.getContent() + hElement.toString());
            }
        }
        return allTreeList;
    }

/**
     * 傳入html解析的樹 和對應文檔id 通過遞歸實現(xiàn)保存
     *
     * @param treeList
     * @param id
     * @param parentId
     * @return java.lang.String
     * @author Wangkuo
     * @date 2023/7/25 16:48
     */
    private String dmeSave(List<DocumentContentVO> treeList, String id, String parentId,String token) {
        String dmeResult = null;
        for (DocumentContentVO documentContentVO : treeList) {
            if (documentContentVO != null) {
                String title = documentContentVO.getTitle();
                int sort = documentContentVO.getIndex();
                String content = documentContentVO.getContent();
                String url = "創(chuàng)建對應數(shù)據(jù)的第三方url";
                JSONObject jsonObjectParam = this.paramJoint1(title, id, parentId, sort, content);
                dmeResult = DmeTestRequestUtil.getDmeResult(url, jsonObjectParam, token);
                List data = JSONObject.parseObject(dmeResult).getObject("data", List.class);
                if (data != null && !data.isEmpty()) {
                    JSONObject jsonObject = (JSONObject) data.get(0);
                    String parentIdNext = jsonObject.getString("id");
                    if (documentContentVO.getChildList() != null && documentContentVO.getChildList().size() > 0) {
                        dmeSave(documentContentVO.getChildList(), id, parentIdNext,token);
                    }
                }
            }
        }
        return dmeResult;
    }

/**
     * 同理，第三方接口規(guī)定的參數(shù)樣式，需要進行拼接
     * @param title
     * @param id
     * @param parentId
     * @param sort
     * @param content 
     * @return com.alibaba.fastjson.JSONObject
     * @author Wangkuo
     * @date 2023/7/27 11:30
     */
    private JSONObject paramJoint1(String title, String id, String parentId, int sort, String content) {
        Map<String, Object> mapStr = new HashMap<>();
        Map<String, Object> paramsMap = new HashMap<>();
        if (id != null) {
            paramsMap.put("title", title);
            paramsMap.put("sort", sort);
            paramsMap.put("parentId", parentId);
            paramsMap.put("content", content);
            paramsMap.put("documentId", id);
        } else {
            paramsMap.put("title", title);
        }
        mapStr.put("params", paramsMap);
        return JSONObject.parseObject(JSON.toJSONString(mapStr));
    }

(3). 根據(jù)documentId查詢取該篇文檔的標題內容樹形結構

/**
 * 根據(jù)documentId查詢對應的word文檔的樹形結構
 * @param reqJSON {"id":"525663360008593408"}
 * @return java.util.List<TitleTreeVO>
 * @author Wangkuo
 * @date 2023/7/27 11:32
 */
public List<TitleTreeVO> getTreeCon(JSONObject reqJSON) {
    String id = reqJSON.getString("id");
    List<TitleTreeVO> allTitleByDocId = this.getAllTitleByDocId(id);
    TitleTreeVO titleTreeVO = new TitleTreeVO();
    titleTreeVO.setId("0");
    this.getChild(titleTreeVO,allTitleByDocId);
    return titleTreeVO.getChildList();
}

/**
     * 根據(jù)文檔id獲取到該文檔的所有標題（此時獲取的集合沒有父子級關系）
     * @param docId 
     * @return java.util.List<TitleTreeVO>
     * @author Wangkuo
     * @date 2023/7/27 11:34
     */
    private List<TitleTreeVO> getAllTitleByDocId(String docId) {
        String url = "第三方標題表的查詢";
				// 參數(shù)拼接
        JSONObject docIdParam = getDocIdParam(docId);
        String token = DmeTestRequestUtil.getToken();
        String dmeResult = DmeTestRequestUtil.getDmeResult(url, docIdParam, token);
        JSONObject jsonObject = JSONObject.parseObject(dmeResult);
        List data = jsonObject.getObject("data", List.class);
        List<TitleTreeVO> titleList = new ArrayList<>();
        if (data != null && !data.isEmpty()) {
            for (Object title : data) {
                JSONObject titleJson = (JSONObject)title;
                TitleTreeVO titleTreeVO = new TitleTreeVO();
                titleTreeVO.setContent(titleJson.getString("content"));
                titleTreeVO.setTitle(titleJson.getString("title"));
                titleTreeVO.setId(titleJson.getString("id"));
                titleTreeVO.setIndex(Integer.parseInt(titleJson.getString("sort")));
                titleTreeVO.setDocumentId(titleJson.getString("documentId"));
                titleTreeVO.setParentId(titleJson.getString("parentId"));
                titleList.add(titleTreeVO);
            }
        }
        return titleList;
    }

/**
     * 通過遞歸獲取到各級的子標題和內容
     * @param parentTitleTreeVO
     * @param titleListOld
     * @return TitleTreeVO
     * @author Wangkuo
     * @date 2023/7/27 11:42
     */
    private TitleTreeVO getChild(TitleTreeVO parentTitleTreeVO,List<TitleTreeVO> titleListOld) {
        List<TitleTreeVO> titleList = new ArrayList<>();
        if (titleListOld != null && titleListOld.size()>0) {
            List<TitleTreeVO> titleCollect = titleListOld.stream().filter(e -> e.getParentId().equals(parentTitleTreeVO.getId())).collect(Collectors.toList());
            if(titleCollect.size()>0){
                for (TitleTreeVO title : titleCollect) {
                    TitleTreeVO titleTreeVO = new TitleTreeVO();
                    titleTreeVO.setIndex(title.getIndex());
                    titleTreeVO.setTitle(title.getTitle());
                    titleTreeVO.setId(title.getId());
                    titleTreeVO.setContent(title.getContent());
                    titleTreeVO.setDocumentId(title.getDocumentId());
                    titleTreeVO.setParentId(title.getParentId());
                    titleList.add(titleTreeVO);
                    this.getChild(titleTreeVO,titleListOld);
                }
            }
        }
        List<TitleTreeVO> titleSortList = titleList.stream().sorted(Comparator.comparing(TitleTreeVO::getIndex)).collect(Collectors.toList());
        parentTitleTreeVO.setChildList(titleSortList);
        return parentTitleTreeVO;
    }