Here I will describe: registering xml namespaces, editing a pre-existing XML file, XPath search within python, Updating Elements, deleting xml elements and more...
The official python (or python3.8 or python3) XML docs are not too sweet especially if a person has never worked with XPath or A real world(relevant) XML file(not just an example file for acdemics).
Read this if you want avoid silly gotchas that would put your workday into NOGAF section of history marked with tears
The Concepts are described top to bottom to cover every(practical) aspect of XML handling - in sequence with Python (or python3.8) using native python "ElementTree" module(i'm not a fan of dependencies but lxml is compatible with stuff mentioned here, albeit not recommended)
You can Just read the python XML stuff you need to get shiz done
This article assumes you have read the docs but for the sake of S*O & w0r*d count here's how to use the native python ElementTree module to read a file
from xml.etree import ElementTree as ET # Read XML file tree = ET.parse(filepath, parser) # parser argument is optional but useful which shall be covered later root = tree.getroot()
The above code snippet reads the file you're dealing with & stores it into the root object(enabling convenient methods)
Alert ! Before doing that ↓↓↓↓
Namespace is used by a program to tell the difference between two elements with common name under different parents
The python(the snake) ELementTree needs a namespace associated with every element of the file the file otherwise a default namespace is added by the ElementTree (or etree if you're lxml fan)
Hence the need of registering namespaces. Here's what unintended consequences look like ↓↓↓↓:
In the above img one can see that the root of the xml is urlset and it has an attribute "xmlns" which is (obviously) "xml namespace" with the value being a url
Simply reading and writing to such a file can produce random namespace in the output file
To avoid that take a look at the following code
# Register Namespace ET.register_namespace('', "http://www.sitemaps.org/schemas/sitemap/0.9")
This sets the base xml namespace to the url. The first argument to this function is going to be the namespace prefix which is an empty string here because we're setting the base namespace.
In XML a "prefix" looks like this ---> :prefixname. The img a attribute with a prefix is "xmlns:image" whose value is yet another url. To set a prefix do this:
# Register a Prefix ET.register_namespace('image', "value")
If that doesn't set the prefix on the root element. You must do that using "set" method after loading the file with code covered in the Basics section. Hence the code looks like this now
# Register Namespace ET.register_namespace('', "http://www.sitemaps.org/schemas/sitemap/0.9") # The Following Bloddy mitch didn't work ET.register_namespace('image', "value") # Read XML file tree = ET.parse(filepath, parser) root = tree.getroot() # Set image because this -> register image mitch mitigate root.set("xmlns:prefixname", "value")
This resolves the following:
If you're parsing XML I assume you must have used bs4 to parse HTML files(which is essentially a special case of XML). You know that there are dozens of methods available that can find elements in HTML tree by id, class, or any other attribute or by element name parent and children.
Regardless of your XP with bs4, ElementTree supports XPath that can get you any element and start the process of doing stuff to it(CRUD operations). Here's a sample XPath Query to get a unique element with ElementTree.
.//{Your-Root-Namespace}The-Element's-name[@id='your-elenemets-id']
Without this the XPath query will always return None
After finding an element it can be treated like an ordinary python list
So to update children you can do
element[index-of-child-to-change].text = "random" element[index-of-child-to-change].set("id", "change child's attrib")
The set() method can be used on the main element to change it's attributes
ELements can be added anywhere in the tree
root.insert(index_where_to_add, element_to_add)
Since any element in the ELementTree is a list to remove an
root.remove(element)
This part is quite tedious but, ElementTree has a function called SubElement. Essentially a root element is need to start with which can be created by ET.Element("name of element. e.g 'url'")
Then call SubElement() and add as many elements to the root element like this:
url = ET.Element("url", attrib=dict(id=name)) loc = ET.SubElement(url, 'loc') loc.text=f"batman{name}" last = ET.SubElement(url, "lastmod") last.text="lslslsl"
The output is a "url" element with "lastmod" & "changefreq" xml blocks as children
Here I try to anwser how to preserve comments in the XML file? post editing with ElementTree.
If you use Python3.8, you're in luck, since in this version an extra keyworded argument in "insert_comments" makes that possible Without any other config.This is how to use it
# Custom Parser for preserving comments parser = ET.XMLParser(target=ET.TreeBuilder(insert_comments=True)) # Now Send the parser to getroot() tree = ET.parse(filepath, parser) # As described in the Basics Section
The tree object which is basically the result of parser function loading a XML file into memory has the following method on it: tree.write()
This method can write the modified tree into the orignal file or a completey different file. How to use it:
tree.write(filepath, encoding="utf-8", xml_declaration=True, short_empty_elements=False)
xml_declaration arg is to preserve XML version found on the top. filename is the location of the file you want to write to.
Before I leave you on your own, here are some ELementTree gotchas
Voila! You just saved a ton of time googling shiz around. Now with the XML 'how tos' covered I leave the writing python code to tailor your program in your hands.