Python for Data Science – Importing XML to Pandas DataFrame

In my previous post, I showed how easy to import data from CSV, JSON, Excel files using Pandas package. Another popular format to exchange data is XML. Unfortunately Pandas package does not have a function to import data from XML so we need to use standard XML package and do some extra work to convert the data to Pandas DataFrames.

Here’s a sample XML file (save it as test.xml):

We want to convert his to a dataframe which contains customer name, email, phone and street:

As you can see, we need to read attribute of an XML tag (customer name), text value of sub elements (address/street), so although we will use a very simple method, it will show you how to parse even complex XML files using Python.

Python provides a comprehensive XML package which provides different APIs to parse XML. Unfortunately none of them are secure against erroneous or maliciously constructed data. Please do not forget to check XML vulnerabilities. Anyway, I prefer the ElementTree XML API which is a simple and lightweight XML processor.

To read and parse an XML file, all we need is to call the parse method:

The above code will return an ElementTree object, then we can use “iter()” method to generate an iterator (for specific XML elements) or “getroot()” to get the root element for this tree, and then iterate all elements.

It’s almost done. We get customer data (name, email, phone and street). To be able to add these data to a DataFrame, we need to define a DataFrame before we iterate elements, then for each customer, we build a Pandas.Series object (an array), and append this Series object to the DataFrame. While doing this, we should also be ready to handle “null” values because some XML paths might be missing on our XML file.

Here’s the full script:

And the output:

As you can see, I defined a function to handle missing paths, and also gave the same column names while appending the series to the DataFrame (otherwise they would be added as additional columns). Please do not hesitate to ask questions! See you on next blog post!

Please share
  • 2
  •  
  •  
  •  
  •  
  •  

AWS Big Data Specialist. Oracle Certified Professional (OCP) for EBS R12, Oracle 10g and 11g. Co-author of "Expert Oracle Enterprise Manager 12c" book published by Apress. Awarded as Oracle ACE (in 2011) and Oracle ACE Director (in 2016) for the continuous contributions to the Oracle users community. Founding member, and vice president of Turkish Oracle User Group (TROUG). Presented at various international conferences including Oracle Open World.

11 Comments

  1. Kyle

    Hey I love what you’ve done here but what I’ve seen is that almost all .xml to pandas code requires an almost PERFECT .XML file. Any luck with getting around a clean set>

  2. JISNA JOSE

    Sir I am Jisna
    gokhanatil sir u r absolutely wonderfull
    come to kottayam good fud and accomodation
    jai bolo pallikkathode ke…

  3. John

    Very helpful post! I’ve found that for a larger dataset, this will run faster if you create a list of dictionaries, rather than constructing a new Series for every row:


    def getDataFromXML(filePath, attribs):
    parsed_xml = et.parse(filePath)
    df_rows = []
    for node in parsed_xml.getroot():
    row_dict = {}
    for attrib in attribs:
    row_dict[attrib] = node.attrib.get(attrib)
    df_rows.append(row_dict)

    return pd.DataFrame(df_rows)

  4. Ashfaque Sayed

    Hi
    The URL link for “import data from CSV, JSON, Excel files using Pandas package” is broken. Can you please check it.
    regards,
    Ashfaque

  5. Lindsey de Beer

    Hi,
    We are working on a project with XML files and we want to convert this to a dataframe. However, if I try your code I get the output None.
    This is the code that I used:
    import xml.etree.cElementTree as et
    import pandas as pd

    def getvalueofnode(node):
    “”” return node text or None “””
    return node.text if node is not None else None

    def main():
    “”” main “””
    parsed_xml = et.parse(“bhic_a2a_mvs_a-201911.xml”)
    dfcols = [‘PersonName’, ‘SourceType’]
    df_xml = pd.DataFrame(columns=dfcols)

    for node in parsed_xml.getroot():
    name = node.attrib.get(‘PersonName’)
    source = node.find(‘SourceType’)

    df_xml = df_xml.append(
    pd.Series([name, getvalueofnode(source)], index=dfcols),
    ignore_index=True)

    print(df_xml)

    main()

    The output is a dataframe with None everywhere

    • Gokhan Atil

      Hi Lindsey, can you share 1 or 2 sample lines of your XML file? Please do not forget to mask the senstive data.

Leave a Reply to Kyle Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.