python使用lxml xpath模塊解析XML遇到的坑及解決

更新時間：2024年05月24日 14:59:37 作者：weixin_45906169

這篇文章主要介紹了python使用lxml xpath模塊解析XML遇到的坑及解決,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教

項目場景

解析電子病歷CDA文檔，由于CDA文檔是XML 格式的，有些節(jié)點的屬性值需要修改。

問題描述

在使用python 解析xml時，百度了很多方面的資料，其實都不盡人意，要么示例不夠詳細，要么示例本身就是坑，總結(jié)一下，主要遇到的是這幾個方面的問題

1.使用etree.fromstring(new_doc_content)報錯

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

在這里插入圖片描述

2.xpath無法獲取值、返回值為[]或者{}的問題

原因分析

1.由于數(shù)據(jù)是從數(shù)據(jù)庫查詢出來得到的，所以etree.fromstring(new_doc_content)需要傳 byte string

2.由于CDA文檔含有字符聲明，以及命名空間的，在使用常規(guī)的xpath語法取不到數(shù)據(jù)，或者有些text能取到，其他節(jié)點或者屬性值取不到。那么在含有命名空間的xml數(shù)據(jù)里，xpath需要將命名空間也帶上才能正常取到，其實問題就出在命名空間這里，從網(wǎng)上百度出來的資料，有些命名空間寫成了

ns = {"d" : "http://www.sitemaps.org/schemas/sitemap/0.9"}
url = root.xpath("http://d:loc", namespaces=ns)

正是這里把我?guī)肓苏`區(qū)，使用這個方式反復(fù)調(diào)試，始終是取不到數(shù)據(jù)，從其他地方查到的資料很多也是類似的這種寫法，同時也忽略掉了一些不一樣的點。

例如這樣的寫法：

url = root.xpath("http://d:loc", namespaces={'d' : 'http://www.sitemaps.org/schemas/sitemap/0.9'})`

咋一看只是namespaces的值事先定義好了而已，沒有往其他方向想。

后來通過foo_tree = etree.ElementTree(xml) 然后通過遍歷foo_tree.getroot()修改屬性內(nèi)容，雖然說能解決，但是還是想通過xpath來查詢定位，因為之前爬蟲用過xpath，知道它的便利之處，回過頭來還是要去解決xpath這個問題。

猛回頭，發(fā)現(xiàn)namespaces字典定義的區(qū)別，單引號和雙引號這里有所不同。那就是試試把，將雙引號改成了單引號。

啪，完美，它起作用了，能找到節(jié)點了。

解決方案

1.將str轉(zhuǎn)換成byte string

etree.fromstring(new_doc_content.encode('utf-8'))

2.將namespaces定義的字典中的雙引號換成單引號

url = root.xpath("http://d:loc", namespaces={'d' : 'http://www.sitemaps.org/schemas/sitemap/0.9'})`

示例XML

<?xml version="1.0" encoding="UTF-8"?> 
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:mif="urn:hl7-org:v3/mif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 ..\sdschemas\SDA.xsd"> 
<realmCode code="CN"/> 
<typeId root="2.16.840.1.113883.1.3" extension="POCD_MT000040"/> 
<templateId root="2.16.156.10011.2.1.1.33"/> 
<id root="2.16.156.10011.1.1" extension="545ED988-5235-45F1-BBFD-9326D74FAA43"/> 
<code code="00000" codeSystem="545ED988-5235-45F1-BBFD-9326D74FAA43" codeSystemName="衛(wèi)生信息共享文檔規(guī)范編碼體系"/> 
<title>測試</title> 
<effectiveTime value="20220407090145"/> 
<confidentialityCode code="N" codeSystem="2.16.840.1.113883.5.25" codeSystemName="Confidentiality" displayName="正常訪問保密級別"/> 
<languageCode code="zh-CN"/> 
<setId/> 
<versionNumber/> 
<recordTarget typeCode="RCT" contextControlCode="OP"> 
<patientRole classCode="PAT"> 
<id root="2.16.156.10011.1.11" extension="00000000"/> 
<id root="2.16.156.10011.1.12" extension="00000000"/> 
<id root="2.16.156.10011.1.24" extension="-"/> 
<patient classCode="PSN" determinerCode="INSTANCE"> 
<name>XXX</name> 
<administrativeGenderCode code="1" displayName="男性" codeSystem="2.16.156.10011.2.3.3.4" codeSystemName="生理性別代碼表(GB/T 2261.1)"/> 
<age value="0" unit="歲"/> 
</patient> 
</patientRole> 
</recordTarget> 
</ClinicalDocument>

示例Python

xml = etree.fromstring(new_doc_content.encode('utf-8'))
# 示例的默認命名空間是urn:hl7-org:v3，使用xpath需要將命名空間帶上
effective_time = xml.xpath("http://x:effectiveTime[@*]", namespaces={'x': 'urn:hl7-org:v3'})
extension = xml.xpath('//x:recordTarget//x:patientRole/x:id[@extension]',
                                         namespaces={'x': 'urn:hl7-org:v3'})
print(effective_time)
print(extension)