python讀取文件由于編碼問題失敗匯總以及解決辦法

更新時(shí)間：2023年10月13日 11:15:52 作者：JasonLiu1919

這篇文章主要給大家介紹了關(guān)于python讀取文件由于編碼問題失敗匯總以及解決辦法的相關(guān)資料,文件編碼錯(cuò)誤指的是在Python讀取文件的過程中出現(xiàn)的編碼不匹配的問題,需要的朋友可以參考下

背景

在日常工作中常常涉及用Python讀取文件，但是經(jīng)常遇到各種失敗，比如：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 6342: invalid continuation byte

問題1

分析和排查

本次實(shí)驗(yàn)使用 pandas 讀取文本并展示前5條數(shù)據(jù)：

import pandas as pd
raw_file = "/share/jiepeng.liu/public_data/ner/weiboNER/weiboNER.conll.dev"
df = pd.read_csv(raw_file, sep='\t')
print(df.head())

讀取文件的時(shí)候報(bào)錯(cuò)：

    df = pd.read_csv(raw_file, sep='\t')
  File "/opt/conda/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/conda/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/opt/conda/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 934, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/opt/conda/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine
    return mapping[engine](f, **self.options)
  File "/opt/conda/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 6342: invalid continuation byte

該問題是由于出現(xiàn)了無法進(jìn)行轉(zhuǎn)換的二進(jìn)制數(shù)據(jù)造成的，可以寫一個(gè)腳本來判斷，是整體的字符集參數(shù)選擇上出現(xiàn)了問題，還是出現(xiàn)了部分的無法轉(zhuǎn)換的二進(jìn)制塊：

raw_file = "/share/jiepeng.liu/public_data/ner/weiboNER/weiboNER.conll.dev"
def check_pd_read_utf8():
    #以讀入文件為例：
    f = open(raw_file, "rb")#二進(jìn)制格式讀文件
    print("file_name=", raw_file)
    line_num = 0
    while True:
        line = f.readline()
        line_num +=1
        if not line:
            break
        else:
            try:
                #print(line.decode('utf8'))
                line.decode('utf8')
                #為了暴露出錯(cuò)誤，最好此處不print
            except:
                print("line num={}, text={}".format(line_num, str(line)))

具體有幾種可能：

如果輸出的代碼都是hex形式的，可能就是選擇的解碼字符集出現(xiàn)了錯(cuò)誤。對(duì)于python2.7版本的來說，網(wǎng)上有使用這樣一種方式處理：

#coding=utf8
import sys
reload(sys)
sys.setdefaultencoding("UTF-8")

但是，上述這種方法在python3版本中，已經(jīng)取消了。

如果是字符集出現(xiàn)錯(cuò)誤，可以使用特定方式判斷其字符集編碼方式。這部分腳本代碼會(huì)在后面補(bǔ)充貼出來。也可以使用notepad++打開目標(biāo)文件，查看右下角的部位，會(huì)指示該文件是那種編碼。
有的情況，是這樣的，整個(gè)文件是好的，如果用notepad++打開后，能夠看到文件是可以打開的，似乎什么問題都沒有發(fā)生過，但是，用python進(jìn)行解碼的時(shí)候，卻會(huì)出現(xiàn)錯(cuò)誤(上述實(shí)驗(yàn)代碼就是這種情況)。

check_pd_read_utf8 函數(shù)運(yùn)行結(jié)果如下：

line num=996, text=b'\xed\xa0\xbd\tO\n'
line num=997, text=b'\xed\xb0\xad\tO\n'
line num=998, text=b'\xed\xa0\xbd\tO\n'
line num=999, text=b'\xed\xb0\xad\tO\n'
line num=1000, text=b'\xed\xa0\xbd\tO\n'
line num=1001, text=b'\xed\xb0\xad\tO\n'
line num=1875, text=b'\xed\xa0\xbc\tO\n'
line num=1876, text=b'\xed\xbd\x9d\tO\n'
line num=1877, text=b'\xed\xa0\xbc\tO\n'
line num=1878, text=b'\xed\xbd\x9b\tO\n'
line num=1879, text=b'\xed\xa0\xbc\tO\n'
line num=1880, text=b'\xed\xbd\xb1\tO\n'
line num=1881, text=b'\xed\xa0\xbc\tO\n'
line num=1882, text=b'\xed\xbd\xa3\tO\n'
line num=1883, text=b'\xed\xa0\xbc\tO\n'
line num=1884, text=b'\xed\xbd\x99\tO\n'

進(jìn)一步查看原始文件：

解決方法

確實(shí)在特定行數(shù)據(jù)上存在不屬于編碼字符集中的內(nèi)容，從而導(dǎo)致’utf-8’解碼失敗。有兩種處理方式，

在原始數(shù)據(jù)中將對(duì)應(yīng)的行刪除
在pandas讀取文件時(shí)，設(shè)置encoding_errors='ignore',將錯(cuò)誤行直接忽略

import pandas as pd
raw_file = "/share/jiepeng.liu/public_data/ner/weiboNER/weiboNER.conll.dev"
df = pd.read_csv(raw_file, sep='\t', encoding_errors='ignore')
print(df.head())

問題2

分析和排查

在用 pandas.read_csv 讀取文件后報(bào)錯(cuò)：

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 39252

出現(xiàn)上述問題，說明在特定行存在錯(cuò)誤字符，這種錯(cuò)誤字符的存在使得 pandas csv 解析器無法讀取整個(gè)文件。

pandas.errors.ParserError: Error tokenizing data. C error: EOF 報(bào)錯(cuò)是因?yàn)閜andas讀取csv文件時(shí)，會(huì)默認(rèn)把csv文件中兩個(gè)雙引號(hào)之間的內(nèi)容解析為一個(gè)string，作為一個(gè)字段域讀入，并且忽略兩個(gè)雙引號(hào)之間的分隔符。所以，在默認(rèn)方式下，一旦文件中出現(xiàn)了奇數(shù)個(gè)雙引號(hào)，那么最后一個(gè)引號(hào)從所在的行開始，直到文件結(jié)束也沒有對(duì)應(yīng)的結(jié)束引號(hào)形成單個(gè)字段域，就會(huì)報(bào)這個(gè)異常，即文件結(jié)束符(EOF)出現(xiàn)在了字符串中。

統(tǒng)計(jì)原始文件中雙引號(hào)的個(gè)數(shù)：

解決方法

直接刪除一個(gè)雙引號(hào)的行數(shù)據(jù)，從而確保雙引號(hào)的數(shù)量為偶數(shù)
讀取文件時(shí)，增加參數(shù)quoting=csv.QUOTE_NONE

較為優(yōu)雅的解決方式是設(shè)置參數(shù)quoting的值，從而改變pandas在讀取csv的上述默認(rèn)行為。在pandas的read_csv函數(shù)中，有兩個(gè)參數(shù)和這個(gè)行為有關(guān)，分別是quotechar引用符和quoting引用行為，如下所示，摘自pandas的官方文檔。

quotechar : str (length 1), optional

The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.

quoting : int or csv.QUOTE_* instance, default 0

Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

quotechar 引用符參數(shù)是表示在讀取解析時(shí)，將指定的符號(hào)認(rèn)為是引用符，不僅僅限制于雙引號(hào)，默認(rèn)情況下是雙引號(hào)。被設(shè)為引用符之后，就會(huì)按照上面所述的那樣，在引用符之間的內(nèi)容會(huì)被解析為單個(gè)域讀入，包括換行符和分隔符。而quoting表示引用行為，即如何對(duì)待引用符的解析。這里具有四種情況，分別是csv.QUOTE_MINIMAL, csv.QUOTE_ALL, csv.QUOTE_NONNUMERIC, csv.QUOTE_NONE默認(rèn)是csv.QUOTE_MINIMAL。這4個(gè)參數(shù)的解釋如下：

csv.QUOTE_MINIMAL：只有當(dāng)遇到引用符時(shí)，才會(huì)將引用符之間的內(nèi)容解析為一個(gè)字符域讀入，并且讀取之后的域是沒有引用符的，即引用符本身只作為一個(gè)域的邊界界定，不會(huì)顯示出來；在寫入時(shí)，也只有具有引用符的域會(huì)在文件中加上引用符。
csv.QUOTE_ALL：在寫入文件時(shí)，將所有的域都加上引用符。
csv.QUOTE_NONNUMERIC：寫入文件時(shí)，將非數(shù)字域加上引用符。
csv.QUOTE_NONE：讀取文件時(shí)，不解析引用符，即把引用符當(dāng)做普通字符對(duì)待并且讀入，不做特殊的對(duì)待；在寫入文件時(shí)，也不對(duì)任何域加上引用符。

所以，要解決本次實(shí)驗(yàn)過程遇到的異常，只需要將quoting參數(shù)設(shè)為3，或者導(dǎo)入python的內(nèi)置模塊csv，設(shè)為csv.QUOTE_NONE，這樣pandas在讀取時(shí)，就只會(huì)把引用符當(dāng)做普通字符，從而不會(huì)一直尋找對(duì)應(yīng)的結(jié)束引用符直至文件結(jié)束都沒找到，從而報(bào)錯(cuò)。當(dāng)然，由于這行是亂碼，分隔符數(shù)量很可能也不正常，即分隔后和前面的行的域的個(gè)數(shù)不一致，還會(huì)報(bào)錯(cuò)，所以只需要將error_bad_lines參數(shù)設(shè)為False，這樣pandas就會(huì)自動(dòng)刪除這種不正常的bad lines，從而文件剩下的正常的內(nèi)容就可以正常的讀入了。pandas 1.3版本之后推薦使用on_bad_lines這個(gè)參數(shù)，可以將其on_bad_lines='skip'實(shí)現(xiàn)等同功能。當(dāng)然，根據(jù)quotechar的功能，也可以通過將quotechar設(shè)為其他的單個(gè)字符，從而pandas會(huì)把雙引號(hào)當(dāng)做普通字符，但是這樣做的風(fēng)險(xiǎn)在于可能會(huì)觸發(fā)其他引用符帶來的異常，所以不推薦這樣做。

附錄

檢查文件編碼類型代碼如下：

import chardet
# 使用 chardet 檢查文件編碼類型
def check_file_encoding_type_chardet(file):
    # 二進(jìn)制方式讀取，獲取字節(jié)數(shù)據(jù)，檢測(cè)類型
    with open(file, 'rb') as f:
        encoding = chardet.detect(f.read())['encoding']
        #這種方式把整個(gè)文件讀取進(jìn)去,如果存在異常編碼異常的字符(比如問題1中的數(shù)據(jù))，會(huì)返回None
        #encoding = chardet.detect(f.read()[0:1024])['encoding']# 只讀取部分?jǐn)?shù)據(jù),更快
    print("chardet check file encoding type=", encoding)
file_name = "/share/jiepeng.liu/public_data/ner/weiboNER/weiboNER.conll.train"
check_file_encoding_type_chardet(file_name)
# 使用 magic 來檢查文件編碼類型
def check_file_encoding_type_magic():
    # pip install python-magic
    import magic
    blob = open(file_name, 'rb').read()
    m = magic.Magic(mime_encoding=True)
    encoding = m.from_buffer(blob)
    print("magic check file encoding type=", encoding)
check_file_encoding_type_magic()
# 檢查哪一行出現(xiàn)編碼異常
def check_pd_read_utf8():
    #以讀入文件為例：
    f = open(file_name, "rb")#二進(jìn)制格式讀文件
    print("file_name=", file_name)
    line_num = 0
    while True:
        line = f.readline()
        line_num +=1
        if not line:
            break
        else:
            try:
                #print(line.decode('utf8'))
                line.decode('utf8')
                #為了暴露出錯(cuò)誤，最好此處不print
            except:
                print("line num={}, text={}".format(line_num, str(line)))
check_pd_read_utf8()