快捷導(dǎo)航

使用Python處理數(shù)據(jù)集的技巧分享

更新時(shí)間：2024年12月27日 08:44:59 作者：engchina

這篇文章會(huì)從加載數(shù)據(jù)開始,一步步教大家如何格式化數(shù)據(jù)、保存數(shù)據(jù),最后還會(huì)教大家如何加載處理后的數(shù)據(jù),感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)一下

1. 導(dǎo)入需要的庫
2. 加載預(yù)訓(xùn)練數(shù)據(jù)集
3. 查看數(shù)據(jù)集的前5個(gè)樣本
4. 加載公司微調(diào)數(shù)據(jù)集
5. 格式化數(shù)據(jù)
6. 使用模板格式化數(shù)據(jù)
7. 生成微調(diào)數(shù)據(jù)集
8. 保存處理后的數(shù)據(jù)
9. 加載處理后的數(shù)據(jù)
總結(jié)

1. 導(dǎo)入需要的庫

首先，我們需要導(dǎo)入一些Python庫，這些庫會(huì)幫助我們處理數(shù)據(jù)。代碼如下：

import jsonlines
import itertools
import pandas as pd
from pprint import pprint

import datasets
from datasets import load_dataset

解釋：

jsonlines: 用來處理JSON Lines格式的文件。

itertools: 提供了一些高效的循環(huán)工具。

pandas: 用來處理表格數(shù)據(jù)，比如Excel或CSV文件。

pprint: 用來美化打印數(shù)據(jù)，讓數(shù)據(jù)看起來更整齊。

datasets: 一個(gè)專門用來加載和處理數(shù)據(jù)集的庫。

2. 加載預(yù)訓(xùn)練數(shù)據(jù)集

接下來，我們要加載一個(gè)預(yù)訓(xùn)練的數(shù)據(jù)集。這里我們使用 allenai/c4 數(shù)據(jù)集，它是一個(gè)英文文本數(shù)據(jù)集。

pretrained_dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)

解釋：

load_dataset: 用來加載數(shù)據(jù)集。

"allenai/c4": 數(shù)據(jù)集的名稱。

"en": 表示我們只加載英文部分。

split="train": 表示我們只加載訓(xùn)練集。

streaming=True: 表示以流式方式加載數(shù)據(jù)，適合處理大數(shù)據(jù)集。

3. 查看數(shù)據(jù)集的前5個(gè)樣本

我們可以用以下代碼查看數(shù)據(jù)集的前5個(gè)樣本：

n = 5
print("Pretrained dataset:")
top_n = itertools.islice(pretrained_dataset, n)
for i in top_n:
  print(i)

解釋：

n = 5: 表示我們要查看5個(gè)樣本。

itertools.islice: 用來從數(shù)據(jù)集中取出前5個(gè)樣本。

for i in top_n:: 遍歷這5個(gè)樣本并打印出來。

4. 加載公司微調(diào)數(shù)據(jù)集

假設(shè)我們有一個(gè)名為 lamini_docs.jsonl 的文件，里面存儲(chǔ)了一些問題和答案。我們可以用以下代碼加載這個(gè)文件：

filename = "lamini_docs.jsonl"
instruction_dataset_df = pd.read_json(filename, lines=True)
instruction_dataset_df

解釋：

pd.read_json: 用來讀取JSON Lines格式的文件，并將其轉(zhuǎn)換為表格形式（DataFrame）。

instruction_dataset_df: 打印表格內(nèi)容。

5. 格式化數(shù)據(jù)

我們可以把問題和答案拼接成一個(gè)字符串，方便后續(xù)處理：

examples = instruction_dataset_df.to_dict()
text = examples["question"][0] + examples["answer"][0]
text

解釋：

to_dict(): 把表格數(shù)據(jù)轉(zhuǎn)換成字典格式。

examples["question"][0]: 獲取第一個(gè)問題的內(nèi)容。

examples["answer"][0]: 獲取第一個(gè)答案的內(nèi)容。

text: 把問題和答案拼接成一個(gè)字符串。

6. 使用模板格式化數(shù)據(jù)

我們可以使用模板來格式化問題和答案，讓它們看起來更整齊：

prompt_template_qa = """### Question:
{question}

### Answer:
{answer}"""

question = examples["question"][0]
answer = examples["answer"][0]

text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
text_with_prompt_template

解釋：

prompt_template_qa: 定義了一個(gè)模板，包含“Question”和“Answer”兩部分。

format: 把問題和答案插入到模板中。

7. 生成微調(diào)數(shù)據(jù)集

我們可以把所有的問答對都格式化，并保存到一個(gè)列表中：

num_examples = len(examples["question"])
finetuning_dataset_text_only = []
finetuning_dataset_question_answer = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]

  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
  finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})

  text_with_prompt_template_q = prompt_template_q.format(question=question)
  finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})

解釋：

num_examples: 獲取問題的數(shù)量。

finetuning_dataset_text_only: 存儲(chǔ)格式化后的文本。

finetuning_dataset_question_answer: 存儲(chǔ)格式化后的問題和答案。

8. 保存處理后的數(shù)據(jù)

我們可以把處理后的數(shù)據(jù)保存到一個(gè)新的文件中：

with jsonlines.open(f'lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

解釋：

jsonlines.open: 打開一個(gè)文件，準(zhǔn)備寫入數(shù)據(jù)。

writer.write_all: 把所有的數(shù)據(jù)寫入文件。

9. 加載處理后的數(shù)據(jù)

最后，我們可以加載剛剛保存的數(shù)據(jù)集：

finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)

解釋：

load_dataset: 加載指定名稱的數(shù)據(jù)集。

print(finetuning_dataset): 打印加載的數(shù)據(jù)集。

總結(jié)

通過這篇文章，我們學(xué)習(xí)了如何用Python加載、處理和保存數(shù)據(jù)集。我們從簡單的數(shù)據(jù)加載開始，逐步學(xué)習(xí)了如何格式化數(shù)據(jù)、保存數(shù)據(jù)，最后還學(xué)會(huì)了如何加載處理后的數(shù)據(jù)。

到此這篇關(guān)于使用Python處理數(shù)據(jù)集的技巧分享的文章就介紹到這了,更多相關(guān)Python處理數(shù)據(jù)集內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: