亚洲乱码中文字幕综合,中国熟女仑乱hd,亚洲精品乱拍国产一区二区三区,一本大道卡一卡二卡三乱码全集资源,又粗又黄又硬又爽的免费视频

Python實(shí)現(xiàn)簡單HTML表格解析的方法

 更新時間:2015年06月15日 14:48:50   作者:小卒過河  
這篇文章主要介紹了Python實(shí)現(xiàn)簡單HTML表格解析的方法,涉及Python基于libxml2dom模塊操作html頁面元素的技巧,需要的朋友可以參考下

本文實(shí)例講述了Python實(shí)現(xiàn)簡單HTML表格解析的方法。分享給大家供大家參考。具體分析如下:

這里依賴libxml2dom,確保首先安裝!導(dǎo)入到你的腳步并調(diào)用parse_tables() 函數(shù)。

1. source = a string containing the source code you can pass in just the table or the entire page code

2. headers = a list of ints OR a list of strings
If the headers are ints this is for tables with no header, just list the 0 based index of the rows in which you want to extract data.
If the headers are strings this is for tables with header columns (with the tags) it will pull the information from the specified columns

3. The 0 based index of the table in the source code. If there are multiple tables and the table you want to parse is the third table in the code then pass in the number 2 here

It will return a list of lists. each inner list will contain the parsed information.

具體代碼如下:

#The goal of table parser is to get specific information from specific
#columns in a table.
#Input: source code from a typical website
#Arguments: a list of headers the user wants to return
#Output: A list of lists of the data in each row
import libxml2dom
def parse_tables(source, headers, table_index):
  """parse_tables(string source, list headers, table_index)
    headers may be a list of strings if the table has headers defined or
    headers may be a list of ints if no headers defined this will get data
    from the rows index.
    This method returns a list of lists
    """
  #Determine if the headers list is strings or ints and make sure they
  #are all the same type
  j = 0
  print 'Printing headers: ',headers
  #route to the correct function
  #if the header type is int
  if type(headers[0]) == type(1):
    #run no_header function
    return no_header(source, headers, table_index)
  #if the header type is string
  elif type(headers[0]) == type('a'):
    #run the header_given function
    return header_given(source, headers, table_index)
  else:
    #return none if the headers aren't correct
    return None
#This function takes in the source code of the whole page a string list of
#headers and the index number of the table on the page. It returns a list of
#lists with the scraped information
def header_given(source, headers, table_index):
  #initiate a list to hole the return list
  return_list = []
  #initiate a list to hold the index numbers of the data in the rows
  header_index = []
  #get a document object out of the source code
  doc = libxml2dom.parseString(source,html=1)
  #get the tables from the document
  tables = doc.getElementsByTagName('table')
  try:
    #try to get focue on the desired table
    main_table = tables[table_index]
  except:
    #if the table doesn't exits then return an error
    return ['The table index was not found']
  #get a list of headers in the table
  table_headers = main_table.getElementsByTagName('th')
  #need a sentry value for the header loop
  loop_sentry = 0
  #loop through each header looking for matches
  for header in table_headers:
    #if the header is in the desired headers list 
    if header.textContent in headers:
      #add it to the header_index
      header_index.append(loop_sentry)
    #add one to the loop_sentry
    loop_sentry+=1
  #get the rows from the table
  rows = main_table.getElementsByTagName('tr')
  #sentry value detecting if the first row is being viewed
  row_sentry = 0
  #loop through the rows in the table, skipping the first row
  for row in rows:
    #if row_sentry is 0 this is our first row
    if row_sentry == 0:
      #make the row_sentry not 0
      row_sentry = 1337
      continue
    #get all cells from the current row
    cells = row.getElementsByTagName('td')
    #initiate a list to append into the return_list
    cell_list = []
    #iterate through all of the header index's
    for i in header_index:
      #append the cells text content to the cell_list
      cell_list.append(cells[i].textContent)
    #append the cell_list to the return_list
    return_list.append(cell_list)
  #return the return_list
  return return_list
#This function takes in the source code of the whole page an int list of
#headers indicating the index number of the needed item and the index number
#of the table on the page. It returns a list of lists with the scraped info
def no_header(source, headers, table_index):
  #initiate a list to hold the return list
  return_list = []
  #get a document object out of the source code
  doc = libxml2dom.parseString(source, html=1)
  #get the tables from document
  tables = doc.getElementsByTagName('table')
  try:
    #Try to get focus on the desired table
    main_table = tables[table_index]
  except:
    #if the table doesn't exits then return an error
    return ['The table index was not found']
  #get all of the rows out of the main_table
  rows = main_table.getElementsByTagName('tr')
  #loop through each row
  for row in rows:
    #get all cells from the current row
    cells = row.getElementsByTagName('td')
    #initiate a list to append into the return_list
    cell_list = []
    #loop through the list of desired headers
    for i in headers:
      try:
        #try to add text from the cell into the cell_list
        cell_list.append(cells[i].textContent)
      except:
        #if there is an error usually an index error just continue
        continue
    #append the data scraped into the return_list    
    return_list.append(cell_list)
  #return the return list
  return return_list

希望本文所述對大家的Python程序設(shè)計有所幫助。

相關(guān)文章

  • 我用Python給班主任寫了一個自動閱卷腳本(附源碼)

    我用Python給班主任寫了一個自動閱卷腳本(附源碼)

    這篇文章主要介紹了如何用Python給寫了一個自動閱卷腳本,本文給大家介紹的非常詳細(xì),對大家的學(xué)習(xí)或工作具有一定的參考借鑒價值,需要的朋友可以參考下
    2021-08-08
  • Python實(shí)現(xiàn)目錄自動清洗

    Python實(shí)現(xiàn)目錄自動清洗

    這篇文章主要為大家詳細(xì)介紹了Python實(shí)現(xiàn)目錄自動清洗的相關(guān)知識,文中的示例代碼講解詳細(xì),具有一定的借鑒價值,感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)一下
    2023-11-11
  • 使用python實(shí)現(xiàn)簡單去水印功能

    使用python實(shí)現(xiàn)簡單去水印功能

    這篇文章主要為大家詳細(xì)介紹了使用python實(shí)現(xiàn)簡單去水印功能,文中示例代碼介紹的非常詳細(xì),具有一定的參考價值,感興趣的小伙伴們可以參考一下
    2022-05-05
  • Python 列表(List)操作方法詳解

    Python 列表(List)操作方法詳解

    這篇文章主要介紹了Python中列表(List)的詳解操作方法,包含創(chuàng)建、訪問、更新、刪除、其它操作等,需要的朋友可以參考下
    2014-03-03
  • 基于Python實(shí)現(xiàn)文件分類器的示例代碼

    基于Python實(shí)現(xiàn)文件分類器的示例代碼

    這篇文章主要為大家詳細(xì)介紹了如何基于Python實(shí)現(xiàn)文件分類器,目的主要是為了將辦公過程中產(chǎn)生的各種格式的文件完成整理,感興趣的可以了解一下
    2023-04-04
  • Python自然語言處理停用詞過濾實(shí)例詳解

    Python自然語言處理停用詞過濾實(shí)例詳解

    這篇文章主要為大家介紹了Python自然語言處理停用詞過濾實(shí)例詳解,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪
    2024-01-01
  • Python中使用字典對列表中的元素進(jìn)行計數(shù)的幾種方式

    Python中使用字典對列表中的元素進(jìn)行計數(shù)的幾種方式

    本文主要介紹了Python中使用字典對列表中的元素進(jìn)行計數(shù),文中通過示例代碼介紹的非常詳細(xì),對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧
    2024-06-06
  • 基于Python編寫個語法解析器

    基于Python編寫個語法解析器

    這篇文章主要為大家詳細(xì)介紹了如何基于Python編寫個語法解析器,文中的示例代碼講解詳細(xì),具有一定的學(xué)習(xí)價值,感興趣的小伙伴可以了解一下
    2023-07-07
  • 手把手教你pycharm專業(yè)版安裝破解教程(linux版)

    手把手教你pycharm專業(yè)版安裝破解教程(linux版)

    這篇文章主要介紹了 手把手教你pycharm專業(yè)版安裝破解教程(linux版),文中通過示例代碼介紹的非常詳細(xì),對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧
    2019-09-09
  • Python實(shí)戰(zhàn)快速上手BeautifulSoup庫爬取專欄標(biāo)題和地址

    Python實(shí)戰(zhàn)快速上手BeautifulSoup庫爬取專欄標(biāo)題和地址

    BeautifulSoup是爬蟲必學(xué)的技能,BeautifulSoup最主要的功能是從網(wǎng)頁抓取數(shù)據(jù),Beautiful Soup自動將輸入文檔轉(zhuǎn)換為Unicode編碼,輸出文檔轉(zhuǎn)換為utf-8編碼
    2021-10-10

最新評論