XML Parsing Python3.8 ElementTree, registering namespaces and file handling Guide

Here I will describe: registering xml namespaces, editing a pre-existing XML file, XPath search within python, Updating Elements, deleting xml elements and more...

The official python (or python3.8 or python3) XML docs are not too sweet especially if a person has never worked with XPath or A real world(relevant) XML file(not just an example file for acdemics).

Read this if you want avoid silly gotchas that would put your workday into NOGAF section of history marked with tears

The Concepts are described top to bottom to cover every(practical) aspect of XML handling - in sequence with Python (or python3.8) using native python "ElementTree" module(i'm not a fan of dependencies but lxml is compatible with stuff mentioned here, albeit not recommended)

You can Just read the python XML stuff you need to get shiz done

No.1 Basics of Efficiently Reading a XML file with ElementTree

This article assumes you have read the docs but for the sake of S*O & w0r*d count here's how to use the native python ElementTree module to read a file

      from xml.etree import ElementTree as ET

      # Read XML file
      tree = ET.parse(filepath, parser) # parser argument is optional but useful which shall be covered later
      root = tree.getroot()
    

The above code snippet reads the file you're dealing with & stores it into the root object(enabling convenient methods)

Alert ! Before doing that ↓↓↓↓

No.2 Register Namespace (It's necessity)

Namespace is used by a program to tell the difference between two elements with common name under different parents

The python(the snake) ELementTree needs a namespace associated with every element of the file the file otherwise a default namespace is added by the ElementTree (or etree if you're lxml fan)

Hence the need of registering namespaces. Here's what unintended consequences look like ↓↓↓↓:

In the above img one can see that the root of the xml is urlset and it has an attribute "xmlns" which is (obviously) "xml namespace" with the value being a url

Simply reading and writing to such a file can produce random namespace in the output file

To avoid that take a look at the following code

      # Register Namespace
      ET.register_namespace('', "http://www.sitemaps.org/schemas/sitemap/0.9")
    

This sets the base xml namespace to the url. The first argument to this function is going to be the namespace prefix which is an empty string here because we're setting the base namespace.

In XML a "prefix" looks like this ---> :prefixname. The img a attribute with a prefix is "xmlns:image" whose value is yet another url. To set a prefix do this:

      # Register a Prefix
      ET.register_namespace('image', "value")
    

If that doesn't set the prefix on the root element. You must do that using "set" method after loading the file with code covered in the Basics section. Hence the code looks like this now

      # Register Namespace
      ET.register_namespace('', "http://www.sitemaps.org/schemas/sitemap/0.9")

      # The Following Bloddy mitch didn't work
      ET.register_namespace('image', "value")

      # Read XML file
      tree = ET.parse(filepath, parser)
      root = tree.getroot()

      # Set image because this -> register image mitch mitigate
      root.set("xmlns:prefixname", "value")
    

This resolves the following:

  1. ValueError: cannot use non-qualified names with default_namespace option
  2. How to preserve namespace while writing to xml file python python3
  3. how to remove ns0:

Finding Elements with XPath in python with ElementTree

If you're parsing XML I assume you must have used bs4 to parse HTML files(which is essentially a special case of XML). You know that there are dozens of methods available that can find elements in HTML tree by id, class, or any other attribute or by element name parent and children.

Regardless of your XP with bs4, ElementTree supports XPath that can get you any element and start the process of doing stuff to it(CRUD operations). Here's a sample XPath Query to get a unique element with ElementTree.

      .//{Your-Root-Namespace}The-Element's-name[@id='your-elenemets-id']
    

There are TWO MAJOR GOTCHAS here(that you would avoid if you're the lucky reader of the post)

  1. Mentioning the namespace in Query . The namespace must be included in the Query. Simply looking for element X (e.g "url") is not sufficient. This can potentially waste a shiz ton of time
  2. The single quotes after and around = and id-value respectively, is not OPTIONAL . They cannot be replaced by double quotes even when python format strings are used.

    Without this the XPath query will always return None

Updating An Element

After finding an element it can be treated like an ordinary python list

So to update children you can do

      element[index-of-child-to-change].text = "random"
      element[index-of-child-to-change].set("id", "change child's attrib")
    

The set() method can be used on the main element to change it's attributes

ELements can be added anywhere in the tree

      root.insert(index_where_to_add, element_to_add)
    

Deleting An Element

Since any element in the ELementTree is a list to remove an

      root.remove(element)
    

Creating & Adding ELements to ElementTree

This part is quite tedious but, ElementTree has a function called SubElement. Essentially a root element is need to start with which can be created by ET.Element("name of element. e.g 'url'")

Then call SubElement() and add as many elements to the root element like this:

        url = ET.Element("url", attrib=dict(id=name))
        loc = ET.SubElement(url, 'loc')
        loc.text=f"batman{name}"
        last = ET.SubElement(url, "lastmod")
        last.text="lslslsl"
    

The output is a "url" element with "lastmod" & "changefreq" xml blocks as children

Custom Parser To Preserve Comments

Here I try to anwser how to preserve comments in the XML file? post editing with ElementTree.

If you use Python3.8, you're in luck, since in this version an extra keyworded argument in "insert_comments" makes that possible Without any other config.This is how to use it

      # Custom Parser for preserving comments
      parser = ET.XMLParser(target=ET.TreeBuilder(insert_comments=True))

      # Now Send the parser to getroot()
      tree = ET.parse(filepath, parser) # As described in the Basics Section
    

Saving All changes made by algorithm to file i.e Writing To The XML File

The tree object which is basically the result of parser function loading a XML file into memory has the following method on it: tree.write()

This method can write the modified tree into the orignal file or a completey different file. How to use it:

      tree.write(filepath, encoding="utf-8", xml_declaration=True, short_empty_elements=False)
    

xml_declaration arg is to preserve XML version found on the top. filename is the location of the file you want to write to.

Conclusion

Before I leave you on your own, here are some ELementTree gotchas

  1. Keep the existing namespaces of xml file while overwriting it Without changing the schema of an XML file (see registering namespaces)
  2. XPath query always returns None // Not working
    1. You need to add namespace to query
    2. While searching by attrib the value must be surrounded by single quotes
  3. How to preserve comments of an XML input file (use custom ElementTree parser)

Voila! You just saved a ton of time googling shiz around. Now with the XML 'how tos' covered I leave the writing python code to tailor your program in your hands.

PUTT ADD HERE