快捷導(dǎo)航

Using Django with GAE Python 后臺(tái)抓取多個(gè)網(wǎng)站的頁面全文

更新時(shí)間：2016年02月17日 12:59:35 投稿：mdxy-dxy

這篇文章主要介紹了Using Django with GAE Python 后臺(tái)抓取多個(gè)網(wǎng)站的頁面全文,需要的朋友可以參考下

一直想做個(gè)能幫我過濾出優(yōu)質(zhì)文章和博客的平臺(tái) 給它取了個(gè)名叫Moven。。把實(shí)現(xiàn)它的過程分成了三個(gè)階段：
1. Downloader: 對于指定的url的下載并把獲得的內(nèi)容傳遞給Analyser－－這是最簡單的開始
2. Analyser: 對于接受到的內(nèi)容，用Regular Expression 或是 XPath 或是 BeautifulSoup/lxml 進(jìn)行過濾和簡化－－這部分也不是太難
3. Smart Crawler：去抓取優(yōu)質(zhì)文章的鏈接－－這部分是最難的：

Crawler的話可以在Scrapy Framework的基礎(chǔ)上快速的搭建
但是判斷一個(gè)鏈接下的文章是不是優(yōu)質(zhì) 需要一個(gè)很復(fù)雜的算法

最近就先從Downloader 和 Analyser 開始：最近搭了一個(gè)l2z story 并且還有一個(gè) Z Life 和 Z Life@Sina 還有一個(gè)她的博客做為一個(gè)對Downloader 和 Analyser的練習(xí) 我就寫了這個(gè)東西來監(jiān)聽以上四個(gè)站點(diǎn) 并且把它們的內(nèi)容都同步到這個(gè)站上：

http://l2zstory.appspot.com

App 的特色
這個(gè)站上除了最上面的黑色導(dǎo)航條和最右邊的About This Site 部分外，其他的內(nèi)容都是從另外的站點(diǎn)上自動(dòng)獲得
原則上，可以添加任何博客或者網(wǎng)站地址到這個(gè)東西。。。當(dāng)然因?yàn)檫@個(gè)是L2Z Story..所以只收錄了四個(gè)站點(diǎn)在里面
特點(diǎn)是：只要站點(diǎn)的主人不停止更新，這個(gè)東西就會(huì)一直存在下去－－－這就是懶人的力量

值得一提的是， Content 菜單是在客戶端用JavaScript 自動(dòng)生成的－－這樣就節(jié)約了服務(wù)器上的資源消耗

這里用的是html全頁面抓取所以對那些feed沒有全文輸出的站點(diǎn)來說，這個(gè)app 可以去把它要隱藏的文字抓來
在加載的時(shí)候會(huì)花很多時(shí)間因?yàn)槌绦驎?huì)自動(dòng)到一個(gè)沒有全文輸出的頁面上抓取所有的文章列表，作者信息，更新時(shí)間，以及文章全文。。所以打開的時(shí)候請耐心。。。下一步會(huì)加入數(shù)據(jù)存儲(chǔ)部分，這樣就會(huì)快了。。

技術(shù)準(zhǔn)備

前端：

1. CSS 在信奉簡單之上的原則上 twitter的bootstrap.css滿足了我大多數(shù)的要求個(gè)人超喜歡它的 Grid System
2. Javascript上，當(dāng)然選用了jQuery 自從我開始在我的第一個(gè)小項(xiàng)目上用了jQuery 后我就愛上了它那個(gè)動(dòng)態(tài)的目錄系統(tǒng)就是用jQuery快速生成的
為了配合bootstrap.css, bootstrap-dropdown.js 也用到了

服務(wù)器：

這個(gè)app有兩個(gè)版本：
一個(gè)跑在我的Apache上，但是因?yàn)槲业木W(wǎng)絡(luò)是ADSL, 所以ip一直會(huì)變基本上只是我在我的所謂的局域網(wǎng)內(nèi)自測用的。。這個(gè)版本是純Django的
另一個(gè)跑在Google App Engine上地址是 http://l2zstory.appspot.com 在把Django 配置到GAE的時(shí)候我花了很多功夫才把框架搭起來

詳情請見： Using Django with Google App Engine GAE: l2Z Story Setup-Step 1 http://blog.sina.com.cn/s/blog_6266e57b01011mjk.html

后臺(tái)：

主要語言是Python--不解釋，自從認(rèn)識(shí)Python后就沒有離開它

主要用到的module是

1. BeautifulSoup.py 用于html 的解析--不解釋
2. feedparser.py 用于對feed xml的解析－－網(wǎng)上有很多人說GAE不支持feedparser..這里你們得到答案了。?？梢?。。這里我也是花了很久才弄明白到底是怎么回事。?？傊唵沃v就是：可以用！但是feedparser.py這個(gè)文件必須放到跟app.yaml同一個(gè)目錄中不然會(huì)出現(xiàn)網(wǎng)上眾人說的不可以import feedparser的情況

數(shù)據(jù)庫：
Google Datastore: 在下一步中，這個(gè)程序會(huì)每隔30分鐘醒來逐一查看各個(gè)站點(diǎn)有沒有更新并抓取更新后的文章并存入Google 的Datastore中

App 的配置

遵循Google的規(guī)則，配置文件app.yaml 如下：
這里主要是定義了一些static directory－－css 和 javascript的所在地

復(fù)制代碼代碼如下:

application: l2zstory
version: 1
runtime: python
api_version: 1

handlers:

- url: /images
static_dir: l2zstory/templates/template2/images
- url: /css
static_dir: l2zstory/templates/template2/css
- url: /js
static_dir: l2zstory/templates/template2/js
- url: /js
static_dir: l2zstory/templates/template2/js
- url: /.*
script: main.py

URL的配置

這里采用的是Django 里的正則表達(dá)式

復(fù)制代碼代碼如下:

from django.conf.urls.defaults import *

# Uncomment the next two lines to enable the admin:
# from django.contrib import admin
# admin.autodiscover()

urlpatterns = patterns('',
# Example:
# (r'^l2zstory/', include('l2zstory.foo.urls')),

    # Uncomment the admin/doc line below and add 'django.contrib.admindocs'
    # to INSTALLED_APPS to enable admin documentation:
    # (r'^admin/doc/', include('django.contrib.admindocs.urls')),

    # Uncomment the next line to enable the admin:
    # (r'^admin/(.*)', admin.site.root),
    (r'^$','l2zstory.stories.views.L2ZStory'),
    (r'^YukiLife/','l2zstory.stories.views.YukiLife'),
     (r'^ZLife_Sina/','l2zstory.stories.views.ZLife_Sina'),
     (r'^ZLife/','l2zstory.stories.views.ZLife')
)

Views的細(xì)節(jié)

對Django比較熟悉的人應(yīng)該會(huì)從url的配置中看到view的名字了我只把L2ZStory的這個(gè)view貼出來因?yàn)槠渌脑趘iew里的架構(gòu)至少是差不多的

復(fù)制代碼代碼如下:

#from BeautifulSoup import BeautifulSoup
from PyUtils import getAboutPage
from PyUtils import getPostInfos

def L2ZStory(request):
    url="feed://l2zstory.wordpress.com/feed/"
    about_url="http://l2zstory.wordpress.com/about/"
    blog_type="wordpress"
    htmlpages={}
    aboutContent=getAboutPage(about_url,blog_type)
    if aboutContent=="Not Found":
        aboutContent="We use this to tell those past stories..."
    htmlpages['about']={}
    htmlpages['about']['content']=aboutContent
    htmlpages['about']['title']="About This Story"
    htmlpages['about']['url']=about_url
    PostInfos=getPostInfos(url,blog_type,order_desc=True)
    return render_to_response('l2zstory.html',
{'PostInfos':PostInfos,
'htmlpages':htmlpages
})

這里主要是構(gòu)建一個(gè)dictionary of dictionary htmlpages 和一個(gè)list of dictionary PostInfos
htmlpages 主要是存貯站點(diǎn)的 About, Contact US 之類的頁面
PostInfos 會(huì)存貯所有文章的內(nèi)容，作者，發(fā)布時(shí)間之類的

這里面最重要的是PyUtils。。這是這個(gè)app的核心

PyUtils的細(xì)節(jié)

我把一些我認(rèn)為比較重要的細(xì)節(jié)加深了并加了評論

復(fù)制代碼代碼如下:

import feedparser 

import urllib2

import re

from BeautifulSoup import BeautifulSoup

header={

'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:8.0.1) Gecko/20100101 Firefox/8.0.1',

}

＃用來欺騙網(wǎng)站的后臺(tái)。。象新浪這類的網(wǎng)站對我們這類的app十分不友好。。。希望它們可以多象被墻掉的wordpress學(xué)一學(xué)。。

復(fù)制代碼代碼如下:

timeoutMsg="""
The Robot cannot connect to the desired page due to either of these reasons:
1. Great Fire Wall
2. The Blog Site has block connections made by Robots.
"""

def getPageContent(url,blog_type):
    try:
        req=urllib2.Request(url,None,header)
        response=urllib2.urlopen(req)
        html=response.read()
        html=BeautifulSoup(html).prettify()
        soup=BeautifulSoup(html)
        Content=""
        if blog_type=="wordpress":
            try:
                for Sharesection in soup.findAll('div',{'class':'sharedaddy sd-like-enabled sd-sharing-enabled'}):
                    Sharesection.extract()
                for item in soup.findAll('div',{'class':'post-content'}):
                    Content+=unicode(item)
            except:
                Content="No Post Content Found"
        elif blog_type=="sina":
            try:
                for item in soup.findAll('div',{'class':'articalContent '}):
                    Content+=unicode(item)
            except:
                Content="No Post Content Found"

＃對于不同的網(wǎng)站類型應(yīng)用不同的過濾器

    except:
        Content=timeoutMsg
    return removeStyle(Content)

    Content=re.sub(patn,replacepatn,Content)
    ＃運(yùn)用正則表達(dá)式把抓取的內(nèi)容中那些格式通通去掉這樣得到的文字比較純粹
    return Content

def getPostInfos(url,blog_type,order_desc=False):
    feeds=feedparser.parse(url)
    PostInfos=[]
    if order_desc:
        items=feeds.entries[::-1]
    else:
        items=feeds.entries
    Cnt=0
    for item in items:
        PostInfo={}
        PostInfo['title']=item.title
        PostInfo['author']=item.author
        PostInfo['date']=item.date
        PostInfo['link']=item.link

        if blog_type=="wordpress":
            Cnt+=1
            if Cnt<=8:
                PostInfo['description']=getPageContent(item.link,blog_type)
            else:
                PostInfo['description']=removeStyle(item.description)
        elif blog_type=="sina":
            PostInfo['description']=removeStyle(item.description)


        PostInfos.append(PostInfo)

    return PostInfos

template 的概覽

在簡單之上的原則的鼓舞下，所有的站點(diǎn)都統(tǒng)一使用一個(gè)template 這個(gè)template 只接受兩個(gè)變量－－前文中提到的htmlpages 和 PostInfos
重要的片斷是：

復(fù)制代碼代碼如下:

<div class="page-header">

                              <a href="{{htmlpages.about.url}}" name="{{htmlpages.about.title}}"><h3>{{htmlpages.about.title}}</h3></a>

                         </div>

                         <p>

                              {{htmlpages.about.content}}

                         </p>

                         {%for item in PostInfos%}

                         <div class="page-header">

                              <a href="{{item.link}}" name="{{item.title}}"><h3>{{item.title}}</h3></a>

                         </div>

                         <p><i>author: {{item.author}} &nbsp;&nbsp; date: {{item.date}}</i></p>

                         <p>{{item.description}}</p>

                         {%endfor%}

                    </div>