單身狗福利?Python爬取某婚戀網(wǎng)征婚數(shù)據(jù)
目標(biāo)網(wǎng)址https://www.csflhjw.com/zhenghun/34.html?page=1
一、打開(kāi)界面
鼠標(biāo)右鍵打開(kāi)檢查,方框里為你一個(gè)文小姐的征婚信息。。由此判斷出為同步加載
點(diǎn)擊elements,定位圖片地址,方框里為該女士的url地址及圖片地址
可以看出該女士的url地址不全,之后在代碼中要進(jìn)行url的拼接,看一下翻頁(yè)的url地址有什么變化
點(diǎn)擊第2頁(yè)
https://www.csflhjw.com/zhenghun/34.html?page=2
點(diǎn)擊第3頁(yè)
https://www.csflhjw.com/zhenghun/34.html?page=3
可以看出變化在最后
做一下fou循環(huán)格式化輸出一下。。一共10頁(yè)
二、代碼解析
1.獲取所有的女士的url,xpath的路徑就不詳細(xì)說(shuō)了。。
2.構(gòu)造每一位女士的url地址
3.然后點(diǎn)開(kāi)一位女士的url地址,用同樣的方法,確定也為同步加載
4.之后就是女士url地址html的xpath提取,每個(gè)都打印一下,把不要的過(guò)濾一下
5.最后就是文件的保存
打印結(jié)果:
三、完整代碼
# !/usr/bin/nev python # -*-coding:utf8-*- import requests, os, csv from pprint import pprint from lxml import etree def main(): for i in range(1, 11): start_url = 'https://www.csflhjw.com/zhenghun/34.html?page={}'.format(i) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/87.0.4280.88 Safari/537.36' } response = requests.get(start_url, headers=headers).content.decode() # # pprint(response) # 3 解析數(shù)據(jù) html_str = etree.HTML(response) info_urls = html_str.xpath(r'//div[@class="e"]/div[@class="e-img"]/a/@href') # pprint(info_urls) # 4、循環(huán)遍歷 構(gòu)造img_info_url for info_url in info_urls: info_url = r'https://www.csflhjw.com' + info_url # print(info_url) # 5、對(duì)info_url發(fā)請(qǐng)求,解析得到img_urls response = requests.get(info_url, headers=headers).content.decode() html_str = etree.HTML(response) # pprint(html_str) img_url = 'https://www.csflhjw.com/' + html_str.xpath(r'/html/body/div[4]/div/div[1]/div[2]/div[1]/div[' r'1]/img/@src')[0] # pprint(img_url) name = html_str.xpath(r'//div[@class="team-info"]/div[@class="team-e"]/h2/text()')[0] # pprint(name) xueli = html_str.xpath(r'//div[@class="team-info"]/div[@class="team-e"]/p[1]/text()')[0].split(':')[1] # pprint(xueli) job = html_str.xpath(r'//div[@class="team-info"]/div[@class="team-e"]/p[2]/text()')[0].split(':')[1] # pprint(job) marital_status = html_str.xpath(r'//div[@class="team-info"]/div[@class="team-e"]/p[3]/text()')[0].split( ':')[1] # pprint(marital_status) is_child = html_str.xpath(r'//div[@class="team-info"]/div[@class="team-e"]/p[4]/text()')[0].split(':')[1] # pprint(is_child) home = html_str.xpath(r'//div[@class="team-info"]/div[@class="team-e"]/p[5]/text()')[0].split(':')[1] # pprint(home) workplace = html_str.xpath(r'//div[@class="team-info"]/div[@class="team-e"]/p[6]/text()')[0].split(':')[1] # pprint(workplace) requ = html_str.xpath(r'/html/body/div[4]/div/div[1]/div[2]/div[2]/div[2]/p[2]/span/text()')[0].split(':')[1] # pprint(requ) requ = [requ if requ != str() else '無(wú)要求'][0] monologue = html_str.xpath(r'//div[@class="hunyin-1-3"]/p/text()') # pprint(monologue) monologue = [monologue[0].replace(' ', '').replace('\xa0', '') if monologue !=list() else '無(wú)'][0] # pprint(monologue) zeo_age = html_str.xpath(r'/html/body/div[4]/div/div[1]/div[2]/div[2]/div[2]/p[1]/span[1]/text()')[0].split(':')[1] zeo_age = [zeo_age if zeo_age!=str() else '無(wú)要求'][0] # pprint(zeo_age) zeo_address = html_str.xpath(r'/html/body/div[4]/div/div[1]/div[2]/div[2]/div[2]/p[1]/span[2]/text()')[0].split(':')[1] zeo_address = [zeo_address if zeo_address!=str() else '無(wú)要求'][0] # pprint(zeo_address) if not os.path.exists(r'./{}'.format('妹子信息數(shù)據(jù)')): os.mkdir(r'./{}'.format('妹子信息數(shù)據(jù)')) csv_header = ['姓名', '學(xué)歷', '職業(yè)', '婚姻狀況', '有無(wú)子女', '是否購(gòu)房', '工作地點(diǎn)', '擇偶年齡', '擇偶城市', '擇偶要求', '個(gè)人獨(dú)白', '照片鏈接'] with open(r'./{}/{}.csv'.format('妹子信息數(shù)據(jù)', '妹子數(shù)據(jù)'), 'w', newline='', encoding='gbk') as file_csv: csv_writer_header = csv.DictWriter(file_csv, csv_header) csv_writer_header.writeheader() try: with open(r'./{}/{}.csv'.format('妹子信息數(shù)據(jù)', '妹子數(shù)據(jù)'), 'a+', newline='', encoding='gbk') as file_csv: csv_writer = csv.writer(file_csv, delimiter=',') csv_writer.writerow([name, xueli, job, marital_status, is_child, home, workplace, zeo_age, zeo_address, requ, monologue, img_url]) print(r'***妹子信息數(shù)據(jù):{}'.format(name)) except Exception as e: with open(r'./{}/{}.csv'.format('妹子信息數(shù)據(jù)', '妹子數(shù)據(jù)'), 'a+', newline='', encoding='utf-8') as file_csv: csv_writer = csv.writer(file_csv, delimiter=',') csv_writer.writerow([name, xueli, job, marital_status, is_child, home, workplace, zeo_age, zeo_address, requ, monologue, img_url]) print(r'***妹子信息數(shù)據(jù)保存成功:{}'.format(name)) if __name__ == '__main__': main()
到此這篇關(guān)于單身狗福利?Python爬取某婚戀網(wǎng)征婚數(shù)據(jù)的文章就介紹到這了,更多相關(guān)Python爬取征婚數(shù)據(jù)內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
- Python爬取股票信息,并可視化數(shù)據(jù)的示例
- Python爬取數(shù)據(jù)并實(shí)現(xiàn)可視化代碼解析
- python如何爬取網(wǎng)站數(shù)據(jù)并進(jìn)行數(shù)據(jù)可視化
- 高考要來(lái)啦!用Python爬取歷年高考數(shù)據(jù)并分析
- Python爬蟲(chóng)之自動(dòng)爬取某車(chē)之家各車(chē)銷售數(shù)據(jù)
- Python爬蟲(chóng)之爬取某文庫(kù)文檔數(shù)據(jù)
- Python爬蟲(chóng)之爬取2020女團(tuán)選秀數(shù)據(jù)
- python爬蟲(chóng)之教你如何爬取地理數(shù)據(jù)
- Python爬蟲(chóng)實(shí)戰(zhàn)之爬取京東商品數(shù)據(jù)并實(shí)實(shí)現(xiàn)數(shù)據(jù)可視化
相關(guān)文章
Python 寫(xiě)入訓(xùn)練日志文件并控制臺(tái)輸出解析
這篇文章主要介紹了Python 寫(xiě)入訓(xùn)練日志文件并控制臺(tái)輸出解析,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2019-08-08python利用tkinter實(shí)現(xiàn)屏保
這篇文章主要為大家詳細(xì)介紹了python利用tkinter實(shí)現(xiàn)屏保,具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2019-07-07使用python實(shí)現(xiàn)離散時(shí)間傅里葉變換的方法
這篇文章主要介紹了使用python實(shí)現(xiàn)離散時(shí)間傅里葉變換的方法,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2019-09-09Python實(shí)現(xiàn)Mysql數(shù)據(jù)庫(kù)連接池實(shí)例詳解
這篇文章主要介紹了Python實(shí)現(xiàn)Mysql數(shù)據(jù)庫(kù)連接池實(shí)例詳解的相關(guān)資料,需要的朋友可以參考下2017-04-04