快捷導(dǎo)航

Spring AI 文檔的提取、轉(zhuǎn)換、加載功能實(shí)現(xiàn)

更新時(shí)間：2025年04月07日 14:05:49 作者：brother_four

Spring AI 是一個(gè)基于 Spring 生態(tài)系統(tǒng)的框架,旨在簡(jiǎn)化人工智能和機(jī)器學(xué)習(xí)模型的集成,本文將介紹如何使用 Spring AI 和 Apache Tika 構(gòu)建一個(gè)簡(jiǎn)單的 ETL 管道,特別是如何利用?spring-ai-tika-document-reader?依賴來處理和轉(zhuǎn)換文檔數(shù)據(jù),感興趣的朋友一起看看吧

在現(xiàn)代數(shù)據(jù)處理中，ETL（Extract, Transform, Load）管道是一個(gè)非常重要的概念。它允許我們從不同的數(shù)據(jù)源中提取數(shù)據(jù)，進(jìn)行必要的轉(zhuǎn)換，然后將數(shù)據(jù)加載到目標(biāo)存儲(chǔ)系統(tǒng)中。本文將介紹如何使用 Spring AI 和 Apache Tika 構(gòu)建一個(gè)簡(jiǎn)單的 ETL 管道，特別是如何利用 spring-ai-tika-document-reader 依賴來處理和轉(zhuǎn)換文檔數(shù)據(jù)。

1. 框架介紹

1.1 Spring AI 簡(jiǎn)介

Spring AI 是一個(gè)基于 Spring 生態(tài)系統(tǒng)的框架，旨在簡(jiǎn)化人工智能和機(jī)器學(xué)習(xí)模型的集成。它提供了豐富的工具和庫，幫助開發(fā)者快速構(gòu)建智能應(yīng)用。Spring AI 不僅支持常見的機(jī)器學(xué)習(xí)任務(wù)，還提供了與各種數(shù)據(jù)源的集成能力，使得數(shù)據(jù)處理變得更加高效。

1.2 Apache Tika 簡(jiǎn)介

Apache Tika 是一個(gè)內(nèi)容分析工具包，能夠從各種文檔格式（如 PDF、Word、Excel 等）中提取文本和元數(shù)據(jù)。Tika 提供了一個(gè)簡(jiǎn)單的 API，使得開發(fā)者可以輕松地將文檔內(nèi)容提取并轉(zhuǎn)換為結(jié)構(gòu)化數(shù)據(jù)。

1.3 spring-ai-tika-document-reader 依賴

spring-ai-tika-document-reader 是 Spring AI 提供的一個(gè)擴(kuò)展庫，它集成了 Apache Tika 的功能，使得在 Spring 應(yīng)用中處理文檔變得更加簡(jiǎn)單。通過這個(gè)依賴，我們可以輕松地將文檔內(nèi)容提取并轉(zhuǎn)換為 Spring AI 可以處理的格式。

2. 轉(zhuǎn)換文檔

2.1 添加依賴

首先，我們需要在 pom.xml 中添加 spring-ai-tika-document-reader 依賴：

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-tika-document-reader</artifactId>
    <version>1.0.0-M5</version>
</dependency>

讀取文件。

    /**
     * 從輸入流中讀取文件。像后端接收前端文件時(shí)，就可以使用。
     * @param file 附件信息
     * @return
     */
    @PostMapping("etl/read/multipart-file")
    public String readMultipartFile(@RequestParam MultipartFile file) {
        // 從IO流中讀取文件
        Resource resource = new InputStreamResource(file.getInputStream());
        List<Document> documents = new TikaDocumentReader(resource)
                .get();
        return  documents.get(0).getContent();
    }
		/**
     * 從本地文件讀取文件。
     * @param filePath 本地文件地址
     * @return
     */
    @GetMapping("etl/read/local-file")
    public String readFile(@RequestParam String filePath) {
        // 從本地文件讀取文件
        Resource resource = new FileSystemResource("C:\\Users\\augjm\\Desktop\\note.txt");
        List<Document> documents = new TikaDocumentReader(resource)
                .get();
        return  documents.get(0).getContent();
    }
		    /**
     * 從網(wǎng)絡(luò)資源讀取文件。
     * @param filePath 從網(wǎng)絡(luò)資源讀取文件地址
     * @return
     */
    @GetMapping("etl/read/url-file")
    public String readUrl(@RequestParam String filePath) {
        // 從網(wǎng)絡(luò)資源讀取文件。
        Resource resource = new UrlResource(filePath);
        List<Document> documents = new TikaDocumentReader(resource)
                .get();
        return  documents.get(0).getContent();
    }

2.2 轉(zhuǎn)換文檔

Document對(duì)象是ETL Pipeline的核心對(duì)象，它包含了文檔的元數(shù)據(jù)和內(nèi)容。
內(nèi)容轉(zhuǎn)換器：

TokenTextSplitter：可以把內(nèi)容切割成更小的塊方便RAG的時(shí)候提升響應(yīng)速度節(jié)省Token。
ContentFormatTransformer：可以把元數(shù)據(jù)的內(nèi)容變成鍵值對(duì)字符串。
元數(shù)據(jù)轉(zhuǎn)換器：

SummaryMetadataEnricher：使用大模型總結(jié)文檔。會(huì)在元數(shù)據(jù)里面增加一個(gè)summary字段。
KeywordMetadataEnricher：使用大模型提取文檔關(guān)鍵詞?？梢栽谠獢?shù)據(jù)里面增加一個(gè)keywords字段。

    /**
     * 將文本內(nèi)容劃分成更小的塊。
     * @param file 附件信息
     * @return
     */
    @SneakyThrows
    @PostMapping("etl/transform/split")
    public List<String> split(@RequestParam MultipartFile file) {
        // 從IO流中讀取文件
        TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(new InputStreamResource(file.getInputStream()));
        // 將文本內(nèi)容劃分成更小的塊
        List<Document> splitDocuments = new TokenTextSplitter()
                .apply(tikaDocumentReader.get());
        return splitDocuments.stream().map(Document::getContent).toList();
    }

在這個(gè)例子中，split 方法會(huì)將每個(gè) Document 對(duì)象的內(nèi)容切割成更小的塊，并返回一個(gè)新的 Document 對(duì)象列表。

2.2 存儲(chǔ)文檔

根據(jù)以上步驟，就將文檔切割各個(gè)塊，然后就可以將其存儲(chǔ)到向量數(shù)據(jù)庫中了

/**
     * 嵌入文件
     *
     * @param file 待嵌入的文件
     * @return 是否成功
     */
    @PostMapping("save/file/vectorStore")
    public Boolean saveFileVectorStore(@RequestParam MultipartFile file) {
        // 從IO流中讀取文件
        TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(new InputStreamResource(file.getInputStream()));
        // 將文本內(nèi)容劃分成更小的塊
        List<Document> splitDocuments = new TokenTextSplitter()
                .apply(tikaDocumentReader.get());
        // 存入向量數(shù)據(jù)庫，這個(gè)過程會(huì)自動(dòng)調(diào)用embeddingModel,將文本變成向量再存入。
        elasticVectorStore.add(splitDocuments);
        return true;
    }

到此這篇關(guān)于Spring AI 文檔的提取、轉(zhuǎn)換、加載的文章就介紹到這了,更多相關(guān)Spring AI 使用內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: