Python正則表達(dá)式re模塊講解以及其案例舉例

更新時(shí)間：2022年09月30日 11:12:58 作者：hhh江月

Python中re模塊主要功能是通過(guò)正則表達(dá)式是用來(lái)匹配處理字符串的 ,下面這篇文章主要給大家介紹了關(guān)于Python正則表達(dá)式re模塊講解以及其案例舉例的相關(guān)資料,文中通過(guò)實(shí)例代碼介紹的非常詳細(xì),需要的朋友可以參考下

一、re模塊簡(jiǎn)介

Python 的 re 模塊（Regular Expression 正則表達(dá)式）提供各種正則表達(dá)式的匹配操作，和 Perl 腳本的正則表達(dá)式功能類似，使用這一內(nèi)嵌于 Python 的語(yǔ)言工具，盡管不能滿足所有復(fù)雜的匹配情況，但足夠在絕大多數(shù)情況下能夠有效地實(shí)現(xiàn)對(duì)復(fù)雜字符串的分析并提取出相關(guān)信息。

二、正則表達(dá)式的基本概念

所謂的正則表達(dá)式，即就是說(shuō)：

通過(guò)設(shè)定匹配的字符串的格式來(lái)在一個(gè)文本中找出所有符合該格式的一串字符。

1、正則表達(dá)式的語(yǔ)法介紹：

1）特殊字符：

, ., ^, $, {}, [], (), | 等

以上的特殊字符必須使用\來(lái)轉(zhuǎn)義，這樣才能使用原來(lái)的意思。

2）字符類

[] 中的一個(gè)或者是多個(gè)字符被稱為字符類，字符類在匹配時(shí)如果沒(méi)有指定量詞則只會(huì)匹配其中的一個(gè)。

字符類的范圍可以進(jìn)行指定。

比如：

1> [a-zA-Z0-9]表示從a到z，從A到Z，0到9之間的任意一個(gè)字符；

2> 左方括號(hào)后面可以跟隨一個(gè) ^ ，表示否定一個(gè)字符類，字符類在匹配時(shí)如果沒(méi)有指定量詞則匹配其中一個(gè)；

3> 字符類的內(nèi)部，除了 \ 之外，其他的特殊符號(hào)不在為原來(lái)的意思；

4> ^ 放在開(kāi)頭表示否定，放在其他位置表示自身。

3）速記法

. ------可以匹配換行符之外的任何一個(gè)字符

\d ------匹配一個(gè)Unicode數(shù)字
\D ------匹配一個(gè)Unicode非數(shù)字
\s ------匹配Unicode空白
\S ------匹配Unicode非空白
\w ------匹配Unicode單詞字符
\W ------匹配Unicode非單字符
? ------匹配前面的字符0次或者1次
*------匹配前面的字符0次或者多次
+（加號(hào)）------匹配前面的字符1次或者多次
{m} ------匹配前面的表達(dá)式m次
{m, } ------匹配前面的表達(dá)式至少m次
{, n} ------匹配前面的表達(dá)式最多n次
{m, n} ------匹配前面的表達(dá)式至少m次，最多n次
() ------捕獲括號(hào)內(nèi)部的內(nèi)容

2、Python中的正則表達(dá)式模塊

Python中對(duì)于正則表達(dá)式的處理使用的是re模塊，其中的語(yǔ)法可以參加上面所羅列出來(lái)的基本語(yǔ)法，尤其應(yīng)該注意一下上述的 3）速記法中的內(nèi)容。因?yàn)樵谂老x后需要數(shù)據(jù)分析時(shí)，往往會(huì)用到上面 3）速記法中所羅列出來(lái)的那些語(yǔ)法。

3、re模塊的部分方法

1）re.compile()

我們首先在cmd中查看一下 re.compile() 方法的使用方法：

>>> import re
>>> help(re.compile)
Help on function compile in module re:

compile(pattern, flags=0)
    Compile a regular expression pattern, returning a pattern object.

>>>

Compile a regular expression pattern, returning a pattern object.

的意思如下所示：

編譯常規(guī)表達(dá)模式，返回模式對(duì)象。

使用re.compile(r, f)方法生成正則表達(dá)式對(duì)象，然后調(diào)用正則表達(dá)式對(duì)象的相應(yīng)方法。這種做法的好處是生成正則對(duì)象之后可以多次使用。

2）re.findall()

同樣的，我們先看help

>>> help(re.findall)
Help on function findall in module re:

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.

注意這一段話：

Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.

Empty matches are included in the result.

意思是說(shuō)：

re.findall(s,start, end)

返回一個(gè)列表，如果正則表達(dá)式中沒(méi)有分組，則列表中包含的是所有匹配的內(nèi)容，
如果正則表達(dá)式中有分組，則列表中的每個(gè)元素是一個(gè)元組，元組中包含子分組中匹配到的內(nèi)容，但是沒(méi)有返回整個(gè)正則表達(dá)式匹配的內(nèi)容。

3）re.finditer()

>>> help(re.finditer)
Help on function finditer in module re:

finditer(pattern, string, flags=0)
    Return an iterator over all non-overlapping matches in the
    string.  For each match, the iterator returns a match object.

    Empty matches are included in the result.

re.finditer(s, start, end)

返回一個(gè)可迭代對(duì)象

對(duì)可迭代對(duì)象進(jìn)行迭代，每一次返回一個(gè)匹配對(duì)象，可以調(diào)用匹配對(duì)象的group()方法查看指定組匹配到的內(nèi)容，0表示整個(gè)正則表達(dá)式匹配到的內(nèi)容

4） re.search()

>>> help(re.search)
Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found.

re.search(s, start, end)

返回一個(gè)匹配對(duì)象,倘若沒(méi)匹配到，就返回None

search方法只匹配一次就停止，不會(huì)繼續(xù)往后匹配

5）re.match()

>>> help(re.match)
Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found.

re.match(s, start, end)

如果正則表達(dá)式在字符串的起始處匹配，就返回一個(gè)匹配對(duì)象，否則返回None

6） re.sub()

>>> help(re.sub)
Help on function sub in module re:

sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
    a replacement string to be used.

re.sub(x, s, m)

返回一個(gè)字符串。每一個(gè)匹配的地方用x進(jìn)行替換，返回替換后的字符串，如果指定m，則最多替換m次。對(duì)于x可以使用/i或者/gid可以是組名或者編號(hào)來(lái)引用捕獲到的內(nèi)容。

模塊方法re.sub(r, x, s, m)中的x可以使用一個(gè)函數(shù)。此時(shí)我們就可以對(duì)捕獲到的內(nèi)容推過(guò)這個(gè)函數(shù)進(jìn)行處理后再替換匹配到的文本。

7） re.subn()

>>> help(re.subn)
Help on function subn in module re:

subn(pattern, repl, string, count=0, flags=0)
    Return a 2-tuple containing (new_string, number).
    new_string is the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in the source
    string by the replacement repl.  number is the number of
    substitutions that were made. repl can be either a string or a
    callable; if a string, backslash escapes in it are processed.
    If it is a callable, it's passed the match object and must
    return a replacement string to be used.

rx.subn(x, s, m)

與re.sub()方法相同，區(qū)別在于返回的是二元組，其中一項(xiàng)是結(jié)果字符串，一項(xiàng)是做替換的個(gè)數(shù)

8） re.split()

>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.

re.split(s, m)

分割字符串,返回一個(gè)列表，用正則表達(dá)式匹配到的內(nèi)容對(duì)字符串進(jìn)行分割

如果正則表達(dá)式中存在分組，則把分組匹配到的內(nèi)容放在列表中每?jī)蓚€(gè)分割的中間作為列表的一部分

三、正則表達(dá)式使用的實(shí)例

我們就爬一個(gè)蟲來(lái)進(jìn)行正則表達(dá)式的使用吧：

爬取豆瓣電影的Top250榜單并且獲取到每一部電影的相應(yīng)評(píng)分。

import re
import requests
if __name__ == '__main__':
    """
    測(cè)試函數(shù)（main）
    """
    N = 25
    j = 1
    for i in range(0, 226, 25):
        url = f'https://movie.douban.com/top250?start={i}&filter='
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                          '(KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63'
        }
        response = requests.get(url=url, headers=headers)
        result = re.findall(r'<a href="(\S+)">\s+'
                            r'<img width="100" alt="(\S+)" src="\S+" class="">\s+'
                            r'</a>', response.text)
        for movie in result:
            url_0 = movie[0]
            response_0 = requests.get(url=url_0, headers=headers)
            score = re.findall(r'<strong class="ll rating_num" property="v:average">(\S+)'
                               r'</strong>\s+'
                               r'<span property="v:best" content="10.0"></span>',
                               response_0.text)[0]
            print(j, end='  ')
            j += 1
            print(movie[1], end='  ')
            print(movie[0], end='  ')
            print(f'評(píng)分 : {score}')
        i += N

在這里，我們的正則表達(dá)式用來(lái)提取了電影名稱、電影的url鏈接，然后再通過(guò)訪問(wèn)電影的url鏈接進(jìn)入電影的主頁(yè)并獲取到電影的評(píng)分信息。
主要的正則表達(dá)式使用代碼為：

1、獲取電影名稱以及電影url：

result = re.findall(r'<a href="(\S+)">\s+'
                            r'<img width="100" alt="(\S+)" src="\S+" class="">\s+'
                            r'</a>', response.text)

2、獲取電影的相應(yīng)評(píng)分：

score = re.findall(r'<strong class="ll rating_num" property="v:average">(\S+)'
                               r'</strong>\s+'
                               r'<span property="v:best" content="10.0"></span>',
                               response_0.text)[0]

最后我們需要再說(shuō)一下，這里爬蟲的美中不足的地方就是這個(gè)接口似乎不能夠爬取到250了，只能爬取到248個(gè)電影，這個(gè)應(yīng)該只是接口的問(wèn)題，但是影響不是很大啦。

如下圖所示：