Python for Data Science – Importing XML to Pandas DataFrame
In my previous post, I showed how easy to import data from CSV, JSON, Excel files using Pandas package. Another popular format to exchange data is XML. Unfortunately Pandas package does not have a function to import data from XML so we need to use standard XML package and do some extra work to convert the data to Pandas DataFrames.
Here’s a sample XML file (save it as test.xml):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
<?xml version="1.0"?> <data> <customer name="gokhan" > <email>gokhan@gmail.com</email> <phone>555-1234</phone> </customer> <customer name="mike" > <email>mike@gmail.com</email> </customer> <customer name="john" > <email>john@gmail.com</email> <phone>555-4567</phone> </customer> <customer name="david" > <phone>555-6472</phone> <address> <street>Fifth Avenue</street> </address> </customer> </data> |
We want to convert his to a dataframe which contains customer name, email, phone and street:
1 2 3 4 5 |
name email phone street 0 gokhan gokhan@gmail.com 555-1234 None 1 mike mike@gmail.com None None 2 john john@gmail.com 555-4567 None 3 david None 555-6472 Fifth Avenue |
As you can see, we need to read attribute of an XML tag (customer name), text value of sub elements (address/street), so although we will use a very simple method, it will show you how to parse even complex XML files using Python.