Python lxml 清洗 XML 清除 node 和 attribute

引言：

项目中会遇到数据迁移，怎么保准新的数据和老的数据内容保持一致呢，有时需要进行内容测试，这就涉及到 XML 内容的比较，由于数据处理方式改变，需要忽略期望中的不同，所以对特殊的 xpath 进行处理。综合各种研究，觉得 lxml 效率最高，对 xml 的处理非常方便，本文将通过一个例子来解决工作中遇到的数据清洗问题。
内容提要：

XML namespace 概要
lxml 对 XML的操作
lxml 应用清洗XML

XML namespace

有关 XML namespace，可以参考 XML 命名空间，namesp主要是用来解决命名冲突，一般复杂的 XML tag 都带前缀，前缀就代表命名空间。

例如：这是一个正常的 XML

<schema xmlns="http://www.w3.org/2001/XMLSchema" 
        targetNamespace="urn:B" 
        xmlns:B="urn:B" 
        elementFormDefault="qualified">
    <element name="foo">
        <complexType>
            <element name="bar" type="B:myType"/>
        </complexType>
    </element>
    
    <complexType name="myType">
        <choice>
            <element name="baz" type="string"/>
            <element name="bas" type="string"/>
        </choice>
    </complexType>
</schema>

我们也可以用前缀的方式表示：
xmlns:myprefix="http://www.w3.org/2001/XMLSchema"

<myprefix:schema xmlns:myprefix="http://www.w3.org/2001/XMLSchema" 
        targetNamespace="urn:B" 
        xmlns:B="urn:B" 
        elementFormDefault="qualified">

    <myprefix:element name="foo">
        <myprefix:complexType>
            <myprefix:element name="bar" type="B:myType"/>
        </myprefix:complexType>
    </myprefix:element>

    <myprefix:complexType name="myType">
        <myprefix:choice>
            <myprefix:element name="baz" type="string"/>
            <myprefix:element name="bas" type="string"/>
        </myprefix:choice>
    </myprefix:complexType>

</myprefix:schema>

lxml 对 XML的操作

解析 XML：

import lxml.etree as XE
root = XE.fromstring(xml_content)

获取 namesp

namespaces=root.nsmap

定位 node
注意：需要用相对 xpath，不支持绝对 xpath，对于还前缀的tag 一定要带上 namespace

nodes = root.findall(rele_xpath_ignore, namespaces=root.nsmap)

node 移除

node.getparent().remove(node)

attribute 清除

node.attrib.pop(attri_name_list[0])

lxml 应用清洗XML

需求：
就拿上面的 xml 为例

移除节点 myprefix:schema/myprefix:complexType/myprefix:choice
移除 attribute
myprefix:schema/myprefix:element/[@name]

方案：

可以将需要处理的 Xpath 加到一个列表中，或从某个文件读取
满足 xpath 的 node 可能很多，所以需要用个循环处理
要同时处理 node 和 attribute
如果需要处理很多 XML，每个 XML 处理的 xpath 可能不同，所以尽量跳过本 XML中不匹配的 XPATH
想得到赶紧的纯 xml 内容，把 namespac也清除

完整代码：

import re
import lxml.etree as XE

def ignore_xpath_handled_by_lxml(xml_content):
       
    ignore_xpath_set = set()
    ignore_xpath_set.add("myprefix:schema/myprefix:complexType/myprefix:choice")
    ignore_xpath_set.add("myprefix:schema/myprefix:element/[@name]")
   
    root = XE.fromstring(xml_content) 

    root_tag_name = re.findall(".*\}(.*)", root.tag)[0]

    for xpath_ignore in ignore_xpath_set:

        xpath_ignore_tag = xpath_ignore.split("/")[0].split(":")[1]
        
        # reletive path
        index = xpath_ignore.find("/")
        rele_xpath_ignore = ".//" + xpath_ignore[index+1:]

        # handle the xpath: mached the tag
        if xpath_ignore_tag == root_tag_name:
            try:
                attri_name_list = re.findall(".*\[@(.*)\].*", xpath_ignore)
                nodes = root.findall(rele_xpath_ignore, namespaces=root.nsmap)
                if len(nodes) > 0:
                    for node in nodes:
                        if len(attri_name_list) > 0:
                            node.attrib.pop(attri_name_list[0])
                        else:
                            node.getparent().remove(node)
            except Exception as e:
                print("Error: {}".format(e))               
        else:
            continue
        
    root_tag = "myprefix" + ":" + root_tag_name    
    ignore_result = str(XE.tostring(root, pretty_print=True, encoding="unicode"))
    namespace_pattern = re.compile('<' + root_tag + r' xmlns(.|\\s)*>')
    content_without_namespace = re.sub(namespace_pattern, '<'+ root_tag + '>', ignore_result)
    return content_without_namespace

if __name__ == "__main__":
  
    xml_string = '''
                <myprefix:schema xmlns:myprefix="http://www.w3.org/2001/XMLSchema" 
                                 targetNamespace="urn:B" 
                                 xmlns:B="urn:B" 
                                 elementFormDefault="qualified">

                    <myprefix:element name="foo">
                        <myprefix:complexType>
                            <myprefix:element name="bar" type="B:myType"/>
                        </myprefix:complexType>
                    </myprefix:element>

                    <myprefix:complexType name="myType">
                        <myprefix:choice>
                            <myprefix:element name="baz" type="string"/>
                            <myprefix:element name="bas" type="string"/>
                        </myprefix:choice>
                    </myprefix:complexType>

                </myprefix:schema>
                '''
new_xml_string = ignore_xpath_handled_by_lxml(xml_string)
print(new_xml_string)

输出：

<myprefix:schema>

     <myprefix:element>
            <myprefix:complexType>
                  <myprefix:element type="B:myType"/>
             </myprefix:complexType>
     </myprefix:element>

     <myprefix:complexType name="myType">
         </myprefix:complexType>

</myprefix:schema>