Pandas merge合并操作的實現(xiàn)
Pandas 提供的 merge() 函數(shù)能夠進(jìn)行高效的合并操作,這與 SQL 關(guān)系型數(shù)據(jù)庫的 join用法非常相似。從字面意思上不難理解,merge 翻譯為“合并”,指的是將兩個 DataFrame 數(shù)據(jù)表按照指定的規(guī)則進(jìn)行連接,最后拼接成一個新的 DataFrame 數(shù)據(jù)表。
merge() 函數(shù)的法格式如下:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=('_x', '_y'), copy=True)
參數(shù)說明,如下表所示:
參數(shù)名稱 | 說明 |
---|---|
left/right | 兩個不同的 DataFrame 對象。 |
on | 指定用于連接的鍵(即列標(biāo)簽的名字),該鍵必須同時存在于左右兩個 DataFrame 中,如果沒有指定,并且其他參數(shù)也未指定, 那么將會以兩個 DataFrame 的列名交集做為連接鍵。 |
left_on | 指定左側(cè) DataFrame 中作連接鍵的列名。該參數(shù)在左、右列標(biāo)簽名不相同,但表達(dá)的含義相同時非常有用。 |
right_on | 指定左側(cè) DataFrame 中作連接鍵的列名。 |
left_index | 布爾參數(shù),默認(rèn)為 False。如果為 True 則使用左側(cè) DataFrame 的行索引作為連接鍵,若 DataFrame 具有多層 索引(MultiIndex),則層的數(shù)量必須與連接鍵的數(shù)量相等。 |
right_index | 布爾參數(shù),默認(rèn)為 False。如果為 True 則使用左側(cè) DataFrame 的行索引作為連接鍵。 |
how | 要執(zhí)行的合并類型,從 {‘left’, ‘right’, ‘outer’, ‘inner’} 中取值,默認(rèn)為“inner”內(nèi)連接。 |
sort | 布爾值參數(shù),默認(rèn)為True,它會將合并后的數(shù)據(jù)進(jìn)行排序;若設(shè)置為 False,則按照 how 給定的參數(shù)值進(jìn)行排序。 |
suffixes | 字符串組成的元組。當(dāng)左右 DataFrame 存在相同列名時,通過該參數(shù)可以在相同的列名后附加后綴名,默認(rèn)為(‘_x’,‘_y’)。 |
copy | 默認(rèn)為 True,表示對數(shù)據(jù)進(jìn)行復(fù)制。 |
注意:Pandas 庫的 merge() 支持各種內(nèi)外連接,與其相似的還有 join() 函數(shù)(默認(rèn)為左連接)。
準(zhǔn)備兩個不同的DataFrame,這里從文件中讀?。?/p>
文件empdata.csv:
EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
7369,SMITH,CLERK,7902.0,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698.0,1981-02-20,1600,300.0,30
7521,WARD,SALESMAN,7698.0,1981-02-22,1250,500.0,30
7566,JONES,MANAGER,7839.0,1981-04-02,2975,,20
7654,MARTIN,SALESMAN,7698.0,1981-09-28,1250,1400.0,30
7698,BLAKE,MANAGER,7839.0,1981-05-01,2850,,30
7782,CLARK,MANAGER,7839.0,1981-06-09,2450,,10
7788,SCOTT,ANALYST,7566.0,1987-04-19,3000,,20
7839,KING,PRESIDENT,,1981-11-17,5000,,10
7844,TURNER,SALESMAN,7698.0,1981-09-08,1500,0.0,30
7876,ADAMS,CLERK,7788.0,1987-05-23,1100,,20
7900,JAMES,CLERK,7698.0,1981-12-03,950,,30
7902,FORD,ANALYST,7566.0,1981-12-03,3000,,20
7934,MILLER,CLERK,7782.0,1982-01-23,1300,,10
文件deptdata.csv:
DEPTNO,DNAME,LOC
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
下面創(chuàng)建兩個不同的 DataFrame,然后對它們進(jìn)行合并操作:
import pandas as pd import numpy as np df_emp = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') print("員工表中的df數(shù)據(jù):\n",df_emp) print("部門表中的df數(shù)據(jù):\n",df_dept)
輸出如下:
員工表中的df數(shù)據(jù):
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
0 7369 SMITH CLERK 7902.0 1980-12-17 800 NaN 20
1 7499 ALLEN SALESMAN 7698.0 1981-02-20 1600 300.0 30
2 7521 WARD SALESMAN 7698.0 1981-02-22 1250 500.0 30
3 7566 JONES MANAGER 7839.0 1981-04-02 2975 NaN 20
4 7654 MARTIN SALESMAN 7698.0 1981-09-28 1250 1400.0 30
5 7698 BLAKE MANAGER 7839.0 1981-05-01 2850 NaN 30
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
7 7788 SCOTT ANALYST 7566.0 1987-04-19 3000 NaN 20
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
9 7844 TURNER SALESMAN 7698.0 1981-09-08 1500 0.0 30
10 7876 ADAMS CLERK 7788.0 1987-05-23 1100 NaN 20
11 7900 JAMES CLERK 7698.0 1981-12-03 950 NaN 30
12 7902 FORD ANALYST 7566.0 1981-12-03 3000 NaN 20
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10
部門表中的df數(shù)據(jù):
DEPTNO DNAME LOC
0 10 ACCOUNTING NEW YORK
1 20 RESEARCH DALLAS
2 30 SALES CHICAGO
3 40 OPERATIONS BOSTON
1) 在單個鍵上進(jìn)行合并操作
通過 on 參數(shù)指定一個連接鍵,然后對上述 DataFrame 進(jìn)行合并操作:
import pandas as pd import numpy as np df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') # 通過on指定合并的主鍵 print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO"))
輸出結(jié)果:
EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7782 CLARK MANAGER 7839.0 ... NaN 10 ACCOUNTING NEW YORK
1 7839 KING PRESIDENT NaN ... NaN 10 ACCOUNTING NEW YORK
2 7934 MILLER CLERK 7782.0 ... NaN 10 ACCOUNTING NEW YORK
3 7369 SMITH CLERK 7902.0 ... NaN 20 RESEARCH DALLAS
4 7566 JONES MANAGER 7839.0 ... NaN 20 RESEARCH DALLAS
5 7788 SCOTT ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
6 7876 ADAMS CLERK 7788.0 ... NaN 20 RESEARCH DALLAS
7 7902 FORD ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
8 7499 ALLEN SALESMAN 7698.0 ... 300.0 30 SALES CHICAGO
9 7521 WARD SALESMAN 7698.0 ... 500.0 30 SALES CHICAGO
10 7654 MARTIN SALESMAN 7698.0 ... 1400.0 30 SALES CHICAGO
11 7698 BLAKE MANAGER 7839.0 ... NaN 30 SALES CHICAGO
12 7844 TURNER SALESMAN 7698.0 ... 0.0 30 SALES CHICAGO
13 7900 JAMES CLERK 7698.0 ... NaN 30 SALES CHICAGO[14 rows x 10 columns]
2) 在多個鍵上進(jìn)行合并操作
下面示例,指定多個鍵來合并上述兩個 DataFrame 對象:(這里使用自定義數(shù)據(jù)了)
import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4], 'Name': ['Smith', 'Maiki', 'Hunter', 'Hilen'], 'subject_id':['sub1','sub2','sub4','sub6']}) right = pd.DataFrame({ 'id':[1,2,3,4], 'Name': ['Bill', 'Lucy', 'Jack', 'Mike'], 'subject_id':['sub2','sub4','sub3','sub6']}) print(pd.merge(left,right,on=['id','subject_id']))
輸出結(jié)果:
id Name_x subject_id Name_y
0 4 Hilen sub6 Mike
使用how參數(shù)合并
通過how
參數(shù)可以確定 DataFrame 中要包含哪些鍵,如果在左表、右表都不存的鍵,那么合并后該鍵對應(yīng)的值為 NaN。為了便于大家學(xué)習(xí),我們將 how 參數(shù)和與其等價的 SQL 語句做了總結(jié):
Merge方法 | 等效 SQL | 描述 |
---|---|---|
left | LEFT OUTER JOIN | 使用左側(cè)對象的key |
right | RIGHT OUTER JOIN | 使用右側(cè)對象的key |
outer | FULL OUTER JOIN | 使用左右兩側(cè)所有key的并集 |
inner | INNER JOIN | 使用左右兩側(cè)key的交集 |
1) left join
import pandas as pd df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') # 為了演示,這里將雇員7369的部門刪除了,運(yùn)行如下: print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="left"))
輸出結(jié)果:能發(fā)現(xiàn)7369的DEPTNO,DNAME,LOC都為NaN
EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7369 SMITH CLERK 7902.0 ... NaN NaN NaN NaN
1 7499 ALLEN SALESMAN 7698.0 ... 300.0 30.0 SALES CHICAGO
2 7521 WARD SALESMAN 7698.0 ... 500.0 30.0 SALES CHICAGO
3 7566 JONES MANAGER 7839.0 ... NaN 20.0 RESEARCH DALLAS
4 7654 MARTIN SALESMAN 7698.0 ... 1400.0 30.0 SALES CHICAGO
5 7698 BLAKE MANAGER 7839.0 ... NaN 30.0 SALES CHICAGO
6 7782 CLARK MANAGER 7839.0 ... NaN 10.0 ACCOUNTING NEW YORK
7 7788 SCOTT ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
8 7839 KING PRESIDENT NaN ... NaN 10.0 ACCOUNTING NEW YORK
9 7844 TURNER SALESMAN 7698.0 ... 0.0 30.0 SALES CHICAGO
10 7876 ADAMS CLERK 7788.0 ... NaN 20.0 RESEARCH DALLAS
11 7900 JAMES CLERK 7698.0 ... NaN 30.0 SALES CHICAGO
12 7902 FORD ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
13 7934 MILLER CLERK 7782.0 ... NaN 10.0 ACCOUNTING NEW YORK[14 rows x 10 columns]
2) right join
import pandas as pd df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') # 部門40因為沒有員工,所以對應(yīng)員工的信息全部為NaN print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="right"))
輸出結(jié)果:(部門40因為沒有員工,所以對應(yīng)員工的信息全部為NaN)
EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7782.0 CLARK MANAGER 7839.0 ... NaN 10 ACCOUNTING NEW YORK
1 7839.0 KING PRESIDENT NaN ... NaN 10 ACCOUNTING NEW YORK
2 7934.0 MILLER CLERK 7782.0 ... NaN 10 ACCOUNTING NEW YORK
3 7369.0 SMITH CLERK 7902.0 ... NaN 20 RESEARCH DALLAS
4 7566.0 JONES MANAGER 7839.0 ... NaN 20 RESEARCH DALLAS
5 7788.0 SCOTT ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
6 7876.0 ADAMS CLERK 7788.0 ... NaN 20 RESEARCH DALLAS
7 7902.0 FORD ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
8 7499.0 ALLEN SALESMAN 7698.0 ... 300.0 30 SALES CHICAGO
9 7521.0 WARD SALESMAN 7698.0 ... 500.0 30 SALES CHICAGO
10 7654.0 MARTIN SALESMAN 7698.0 ... 1400.0 30 SALES CHICAGO
11 7698.0 BLAKE MANAGER 7839.0 ... NaN 30 SALES CHICAGO
12 7844.0 TURNER SALESMAN 7698.0 ... 0.0 30 SALES CHICAGO
13 7900.0 JAMES CLERK 7698.0 ... NaN 30 SALES CHICAGO
14 NaN NaN NaN NaN ... NaN 40 OPERATIONS BOSTON[15 rows x 10 columns]
3) outer join(并集)
import pandas as pd df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') # 部門40因為沒有員工,所以對應(yīng)員工的信息全部為NaN,又因為7369沒對應(yīng)的部門,所以部門信息全部為Nan print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="outer"))
輸出結(jié)果:
EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7369.0 SMITH CLERK 7902.0 ... NaN NaN NaN NaN
1 7499.0 ALLEN SALESMAN 7698.0 ... 300.0 30.0 SALES CHICAGO
2 7521.0 WARD SALESMAN 7698.0 ... 500.0 30.0 SALES CHICAGO
3 7654.0 MARTIN SALESMAN 7698.0 ... 1400.0 30.0 SALES CHICAGO
4 7698.0 BLAKE MANAGER 7839.0 ... NaN 30.0 SALES CHICAGO
5 7844.0 TURNER SALESMAN 7698.0 ... 0.0 30.0 SALES CHICAGO
6 7900.0 JAMES CLERK 7698.0 ... NaN 30.0 SALES CHICAGO
7 7566.0 JONES MANAGER 7839.0 ... NaN 20.0 RESEARCH DALLAS
8 7788.0 SCOTT ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
9 7876.0 ADAMS CLERK 7788.0 ... NaN 20.0 RESEARCH DALLAS
10 7902.0 FORD ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
11 7782.0 CLARK MANAGER 7839.0 ... NaN 10.0 ACCOUNTING NEW YORK
12 7839.0 KING PRESIDENT NaN ... NaN 10.0 ACCOUNTING NEW YORK
13 7934.0 MILLER CLERK 7782.0 ... NaN 10.0 ACCOUNTING NEW YORK
14 NaN NaN NaN NaN ... NaN 40.0 OPERATIONS BOSTON[15 rows x 10 columns]
4) inner join(交集)
import pandas as pd df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') # 部門40因為沒有員工,,又因為7369沒對應(yīng)的部門,所以部門40和員工7369沒顯示 print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="inner"))
輸出結(jié)果:
EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7499 ALLEN SALESMAN 7698.0 ... 300.0 30.0 SALES CHICAGO
1 7521 WARD SALESMAN 7698.0 ... 500.0 30.0 SALES CHICAGO
2 7654 MARTIN SALESMAN 7698.0 ... 1400.0 30.0 SALES CHICAGO
3 7698 BLAKE MANAGER 7839.0 ... NaN 30.0 SALES CHICAGO
4 7844 TURNER SALESMAN 7698.0 ... 0.0 30.0 SALES CHICAGO
5 7900 JAMES CLERK 7698.0 ... NaN 30.0 SALES CHICAGO
6 7566 JONES MANAGER 7839.0 ... NaN 20.0 RESEARCH DALLAS
7 7788 SCOTT ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
8 7876 ADAMS CLERK 7788.0 ... NaN 20.0 RESEARCH DALLAS
9 7902 FORD ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
10 7782 CLARK MANAGER 7839.0 ... NaN 10.0 ACCOUNTING NEW YORK
11 7839 KING PRESIDENT NaN ... NaN 10.0 ACCOUNTING NEW YORK
12 7934 MILLER CLERK 7782.0 ... NaN 10.0 ACCOUNTING NEW YORK
注意:當(dāng) a 與 b 進(jìn)行內(nèi)連操作時 a.join(b) 不等于 b.join(a)。
5)使用join函數(shù)
如果左右兩個表的主鍵名相同,可以使用join函數(shù)。
import pandas as pd df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') print(df_emp_left.join(df_dept_right,how="outer",rsuffix='1'))
運(yùn)行結(jié)果:
EMPNO ENAME JOB MGR ... DEPTNO DEPTNO1 DNAME LOC
0 7369 SMITH CLERK 7902.0 ... NaN 10.0 ACCOUNTING NEW YORK
1 7499 ALLEN SALESMAN 7698.0 ... 30.0 20.0 RESEARCH DALLAS
2 7521 WARD SALESMAN 7698.0 ... 30.0 30.0 SALES CHICAGO
3 7566 JONES MANAGER 7839.0 ... 20.0 40.0 OPERATIONS BOSTON
4 7654 MARTIN SALESMAN 7698.0 ... 30.0 NaN NaN NaN
5 7698 BLAKE MANAGER 7839.0 ... 30.0 NaN NaN NaN
6 7782 CLARK MANAGER 7839.0 ... 10.0 NaN NaN NaN
7 7788 SCOTT ANALYST 7566.0 ... 20.0 NaN NaN NaN
8 7839 KING PRESIDENT NaN ... 10.0 NaN NaN NaN
9 7844 TURNER SALESMAN 7698.0 ... 30.0 NaN NaN NaN
10 7876 ADAMS CLERK 7788.0 ... 20.0 NaN NaN NaN
11 7900 JAMES CLERK 7698.0 ... 30.0 NaN NaN NaN
12 7902 FORD ANALYST 7566.0 ... 20.0 NaN NaN NaN
13 7934 MILLER CLERK 7782.0 ... 10.0 NaN NaN NaN
到此這篇關(guān)于Pandas merge合并操作的實現(xiàn)的文章就介紹到這了,更多相關(guān)Pandas merge合并內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
- pandas數(shù)據(jù)合并與重塑之merge詳解
- Pandas數(shù)據(jù)集的合并與連接merge()方法
- Python?pandas數(shù)據(jù)合并merge函數(shù)用法詳解
- Pandas merge合并兩個DataFram的實現(xiàn)
- Python?Pandas數(shù)據(jù)合并pd.merge用法詳解
- pandas中DataFrame數(shù)據(jù)合并連接(merge、join、concat)
- Pandas 連接合并函數(shù)merge()詳解
- 在Pandas中DataFrame數(shù)據(jù)合并,連接(concat,merge,join)的實例
相關(guān)文章
Python如何用NumPy讀取和保存點云數(shù)據(jù)
這篇文章主要介紹了Python如何用NumPy讀取和保存點云數(shù)據(jù),文章圍繞主題展開詳細(xì)的內(nèi)容介紹,具有一定的參考價值,需要的小伙伴可以參考一下2022-08-08詳細(xì)聊一聊為什么Python沒有main函數(shù)
相信很多初學(xué)python的人看代碼的時候都會先找一下main()方法,從main往下看,但事實上python中是沒有你理解中的“main()”方法的,下面這篇文章主要給大家介紹了關(guān)于為什么Python沒有main函數(shù)的相關(guān)資料,需要的朋友可以參考下2023-03-03如何在mac環(huán)境中用python處理protobuf
這篇文章主要介紹了如何在mac環(huán)境中用python處理protobuf,文中通過示例代碼介紹的非常詳細(xì),對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友可以參考下2019-12-12Python使用Yagmail庫實現(xiàn)自動化郵件營銷
在數(shù)字營銷領(lǐng)域,自動化郵件營銷是一種高效、低成本的方式,能夠幫助企業(yè)與客戶保持溝通,提升品牌忠誠度,而Yagmail是一個簡潔且功能強(qiáng)大的Python庫,可以大大簡化郵件發(fā)送的過程,本文將詳細(xì)介紹如何使用Yagmail庫來實現(xiàn)自動化郵件營銷,需要的朋友可以參考下2024-12-12