快捷導(dǎo)航

python datatable庫大型數(shù)據(jù)集和多核數(shù)據(jù)處理使用探索

更新時間：2024年01月30日 11:01:51 作者：程序員小寒

這篇文章主要介紹了python datatable庫大型數(shù)據(jù)集和多核數(shù)據(jù)處理使用探索,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪

python datatable庫

今天給大家分享一個超強(qiáng)的 python 庫，datatable

datatable 是一個用于數(shù)據(jù)處理和統(tǒng)計(jì)分析的 python 庫，類似于Pandas，但專注于大型數(shù)據(jù)集和多核數(shù)據(jù)處理。它最初由 H2O.ai 開發(fā)，設(shè)計(jì)目標(biāo)是高效地處理非常大的數(shù)據(jù)集，特別是那些太大以至于不能放入單個機(jī)器的內(nèi)存中的數(shù)據(jù)集。

核心特性

性能，datatable 針對性能進(jìn)行了優(yōu)化，尤其是在處理大型數(shù)據(jù)集時。它使用 C++ 編寫的底層代碼和多線程來加速數(shù)據(jù)處理任務(wù)。
內(nèi)存效率，通過使用內(nèi)存映射文件和其他技術(shù)，datatable 可以有效地處理大于物理內(nèi)存大小的數(shù)據(jù)集。
靈活的數(shù)據(jù)處理能力，支持各種數(shù)據(jù)處理操作，包括過濾、排序、分組、聯(lián)接等。
與Pandas的兼容性，datatable 提供了將其 DataFrame 轉(zhuǎn)換為 Pandas DataFrame 的功能，使得用戶可以利用 Pandas 的 API 進(jìn)行進(jìn)一步的數(shù)據(jù)分析和處理。

庫的安裝

可以直接使用 pip 進(jìn)行安裝。

pip install datatable

讀取數(shù)據(jù)

我們所使用的數(shù)據(jù)集是貸款數(shù)據(jù)集，該文件由 226 萬行和 145 列組成。

讓我們將數(shù)據(jù)加載到 Frame 對象中。datatable 中分析的基本單位是 Frame。它與 pandas DataFrame 或 SQL 表的概念相同，數(shù)據(jù)排列在具有行和列的二維數(shù)組中。

import pandas as pd
import datatable as dt
%%time
datatable_df = dt.fread("test/accepted_2007_to_2018Q4.csv")

可以看到使用時間是 34.7 秒。

datatable 庫的 fread() 函數(shù)可以從多個源讀取數(shù)據(jù)，包括文件、URL等等。

現(xiàn)在，讓我們計(jì)算 pandas 讀取同一文件所花費(fèi)的時間

%%time 
pandas_df= pd.read_csv("test/accepted_2007_to_2018Q4.csv")

可以看到使用時間是2分 30 秒。

結(jié)果表明，在讀取大型數(shù)據(jù)集時，datatable 明顯優(yōu)于 pandas。

幀轉(zhuǎn)換

現(xiàn)有的 Frame 也可以轉(zhuǎn)換為 numpy 或 pandas 數(shù)據(jù)框，如下所示：

numpy_df = datatable_df.to_numpy() 
pandas_df = datatable_df.to_pandas()

Frame 屬性

讓我們看一下 datatable frame 的一些基本屬性，它們類似于 pandas 的屬性。

print(datatable_df.shape)       
print(datatable_df.names[:5])   
print(datatable_df.stypes[:5])

我們還可以使用 head 命令輸出前 'n' 行。

datatable_df.head(10)

統(tǒng)計(jì)摘要

在 pandas 中計(jì)算統(tǒng)計(jì)摘要數(shù)據(jù)是一個消耗內(nèi)存的過程，但在 datatable 中則不再如此。

我們可以使用 datatable 計(jì)算以下每列摘要統(tǒng)計(jì)信息。

datatable_df.mean()

數(shù)據(jù)處理

以下代碼從數(shù)據(jù)集中選擇所有行和 funded_amnt 列。

datatable_df[:,'funded_amnt']

按 funded_amnt_inv 列進(jìn)行排序。

datatable_df.sort('funded_amnt_inv')

刪除名為 member_id 的列。

del datatable_df[:, 'member_id']

就像 pandas 一樣，datatable 也具有 groupby 功能。

讓我們看看如何獲得按 Grade 列分組的 funded_amount 列的平均值。

datatable_df[:, dt.sum(dt.f.funded_amnt), dt.by(dt.f.grade)]

過濾行的語法與 GroupBy 非常相似。讓我們過濾 loan_amnt 中 loan_amnt 值大于 funded_amnt 的那些行。

datatable_df[dt.f.loan_amnt>dt.f.funded_amnt,"loan_amnt"]

以上就是python datatable庫大型數(shù)據(jù)集和多核數(shù)據(jù)處理使用探索的詳細(xì)內(nèi)容，更多關(guān)于python datatable庫的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章:

亚洲乱码中文字幕综合,中国熟女仑乱hd,亚洲精品乱拍国产一区二区三区,一本大道卡一卡二卡三乱码全集资源,又粗又黄又硬又爽的免费视频

python datatable庫大型數(shù)據(jù)集和多核數(shù)據(jù)處理使用探索

目錄

python datatable庫

核心特性

庫的安裝

讀取數(shù)據(jù)

幀轉(zhuǎn)換

Frame 屬性

統(tǒng)計(jì)摘要

數(shù)據(jù)處理

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具