Parsing XML

Reading XML can be more complicated than writing it, if only because we have so many options for reading XML.

There are basically two approaches, using an XML parser or using XSLT. XSLT shines at transforming one type of XML markup into another type of markup, or into HTML, using standardized syntax and vocabulary. Generally, to process XML using XSLT you must run a program called an XSLT processor, such as Saxon or MSXSL. In contrast, parsers are good at picking out specific elements in XML files and are generally bound to programming languages, which means that you must write your parser in Java, Perl, PHP, etc. So, in general, you would choose to use XSLT if you were batch converting XML to HTML (or another XML markup) and you would choose a parser if you were writing a script and wanted to access selected elements in your source XML.

Parsing XML: Three examples

Just as we used the XML::Writer module earlier to avoid having to write all the low-level XML writing routines ourselves, we will use the XML::Simple module to create our parsers. There are dozens of Perl modules for parsing XML, and believe it or not, XML::Simple is one of the most straight forward (although there is another module called XML::Simpler but it is not very mature at ths time).

We will use the following XML file, called booklist.xml, in our examples:

<booklist>
   <book>
      <author>Book 1 author 1</author>
      <author>Book 1 author 2</author>
      <title>Book 1 title</title>
      <isbn>Book1ISBN</isbn>
   </book>
   <book>
      <author>Book 2 author 1</author>
      <author>Book 2 author 2</author>
      <title>Book 2 title</title>
      <isbn>Book2ISBN</isbn>
   </book>
   <book>
      <author>Book 3 author 1</author>
      <author>Book 3 author 2</author>
      <author>Book 3 author 3</author>
      <title>Book 3 title</title>
      <isbn>Book3ISBN</isbn>
   </book>
</booklist>

As you can see, this XML contains only elements, and no attributes. XML::Simple can handle attributes very well, as we will see at the end of this section.

Example 1: Converting from XML to a Perl hash record

This example is intended to illustrate how XML::Simple works internally rather than solve a problem. Basically, using XML::Simple and Perl's Data::Dumper module, you can use the following script to convert our sample XML file into a Perl hash record (which, as we know, is what complex data structures in Perl, usually consisting of arrays of hashes, are called):

#!/usr/bin/perl

# Script to illustrate how to parse a simple XML file
# and dump its contents in a Perl hash record.

use strict;
use XML::Simple;
use Data::Dumper;

my $booklist = XMLin('booklist.xml');

print Dumper($booklist);

Running this script outputs the following:

$VAR1 = {
          'book' => [
                    {
                      'isbn' => 'Book1ISBN',
                      'title' => 'Book 1 title',
                      'author' => [
                                  'Book 1 author 1',
                                  'Book 1 author 2'
                                ]
                    },
                    {
                      'isbn' => 'Book2ISBN',
                      'title' => 'Book 2 title',
                      'author' => [
                                  'Book 2 author 1',
                                  'Book 2 author 2'
                                ]
                    },
                    {
                      'isbn' => 'Book3ISBN',
                      'title' => 'Book 3 title',
                      'author' => [
                                  'Book 3 author 1',
                                  'Book 3 author 2',
                                  'Book 3 author 3'
                                ]
                    }
                  ]
        };

Using what we know about Perl hashes records, we can use constructs like $booklist->{book}->[0]->{title} to access the title element of the first (i.e., index 0) book record the script encounters. However, as you may have noticed, since XML::Simple has converted the XML file into a hash, we can no longer assume that the records in the hash are in the same order as they existed in the input file (in fact, we should assume they are not in the same order). So, variables like $booklist->{book}->[0]->{title} aren't all that useful.

Example 2: Picking out a particular book using a record key

Our sample XML contains a rudimentary record structure: each book has one title and one ISBN, and at least one author. Our file is so small that we problably would never need to write a script to find the title of a book with a given ISBN, but if our input XML contained thousands of records, we might want to write a simple parser to query the file. Here is a script that does this:

#!/usr/bin/perl

# Script to illustrate how to parse a simple XML file
# and pick out a particular element, in this case, the 
# title of the book with the ISBN 'Book2ISBN'.

use strict;
use XML::Simple;

# We use  KeyAttr => {book => 'isbn'} to tell the parser to create
# a data structure that uses the isbn element as a lookup key.
my $booklist = XMLin('booklist.xml', KeyAttr => {book => 'isbn'});

print $booklist->{book}->{Book2ISBN}->{title} . "\n";

This is similar to the first script but we have added a parameter to the XMLin method, "KeyAttr => {book => 'isbn'}". This tells our parser to create an internal Perl hash record using the "isbn" element as hash keys, so we can access particluar XML records using regular Perl hash syntax. In the print statement above, we are telling our script to print out the value of the book element that has an "isbn" child element of a given value (in this case, "Book2ISBN").

To understand how this works, let's look at the Perl hash record that this parser creates is a bit different from the one created without the KeyAttr => {book => 'isbn'} parameter:

$VAR1 = {
          'book' => {
                    'Book3ISBN' => {
                                   'author' => [
                                               'Book 3 author 1',
                                               'Book 3 author 2',
                                               'Book 3 author 3'
                                             ],
                                   'title' => 'Book 3 title'
                                 },
                    'Book2ISBN' => {
                                   'author' => [
                                               'Book 2 author 1',
                                               'Book 2 author 2'
                                             ],
                                   'title' => 'Book 2 title'
                                 },
                    'Book1ISBN' => {
                                   'author' => [
                                               'Book 1 author 1',
                                               'Book 1 author 2'
                                             ],
                                   'title' => 'Book 1 title'
                                 }
                  }
        };

As you can see, the internal structure of the hash record has changed so that the values of the "isbn" elements are now the keys to each book record. Running the script supplied above outputs "Book 2 title", since that is the value of the "title" element in the record with the "isbn" value "Book2ISBN".

Exercise

Change the script in Example 2 so that <title> is the 'key' in the XML records, and print out the ISBN of the book with the title 'Book 3 title'.

Example 3: Picking out all the values of a specific element

In this example, we are interested in printing out all the values of a given element (in this case, all the title elements in all records), not just the value of a particular element in a particular record. As we have learned already, whenever we use the phrase "all values" when writing Perl scripts, we need to iterate through an array using "foreach".

#!/usr/bin/perl

# Script to illustrate how to parse a simple XML file
# and pick out all the values for a specific element, in
# this case all the titles.

use strict;
use XML::Simple;

my $booklist = XMLin('booklist.xml');

foreach my $book (@{$booklist->{book}}) {
	print $book->{title} . "\n";
}

This is where we need to pause for a minute, since we will need to get all the values we want to print out into an array. Luckily, because XML::Simple creates a has reference from the XML file, Perl has automatically put all the values into an array; our job is to figure out how to access those values.

Let's refer again to our hash record printed out in the first example, above, so we can see how Perl stores the parsed XML internally. All of the "book" XML records are in an array that can be accessed with the syntax @{$booklist->{book}. Like any other array, this one begins with an "@" sign (actually, $booklist->{book} is a reference to an array of hashes (don't think about it, you'll get confused, trust us) and by adding the "@" we are dereferencing that array, or converting it back from a reference into a normal array. Using Perl's "->" operator, we can then access individual keys in the hashes that make up the array. If we use 'title' as the key, we get the value corresponding to that key in each record and can then print them out as normal strings:

Book 1 title
Book 2 title
Book 3 title

Example 4: Parsing XML attributes

Let's add a "type" attribute to the "book" elements in our sample file:

<booklist>
   <book type="technical">
      <author>Book 1 author 1</author>
      <author>Book 1 author 2</author>
      <title>Book 1 title</title>
      <isbn>Book1ISBN</isbn>
   </book>
   <book type="fiction">
      <author>Book 2 author 1</author>
      <author>Book 2 author 2</author>
      <title>Book 2 title</title>
      <isbn>Book2ISBN</isbn>
   </book>
   <book type="technical">
      <author>Book 3 author 1</author>
      <author>Book 3 author 2</author>
      <author>Book 3 author 3</author>
      <title>Book 3 title</title>
      <isbn>Book3ISBN</isbn>
   </book>
</booklist>

XML::Simple handles attributes the same way it handles child elements: the attribute name is a key in the hash record. Here is a dump of the new sample file:

$VAR1 = {
          'book' => [
                    {
                      'isbn' => 'Book1ISBN',
                      'title' => 'Book 1 title',
                      'author' => [
                                  'Book 1 author 1',
                                  'Book 1 author 2'
                                ],
                      'type' => 'technical'
                    },
                    {
                      'isbn' => 'Book2ISBN',
                      'title' => 'Book 2 title',
                      'author' => [
                                  'Book 2 author 1',
                                  'Book 2 author 2'
                                ],
                      'type' => 'fiction'
                    },
                    {
                      'isbn' => 'Book3ISBN',
                      'title' => 'Book 3 title',
                      'author' => [
                                  'Book 3 author 1',
                                  'Book 3 author 2',
                                  'Book 3 author 3'
                                ],
                      'type' => 'technical'
                    }
                  ]
        };

The following script will print the titles of the books that have "technical" as the value of the "book" element's "type" attribute:

#!/usr/bin/perl

# Script to illustrate how to parse a simple XML file
# and print titles of type 'technical'.

use strict;
use XML::Simple;
use Data::Dumper;

my $booklist = XMLin('booklist.xml');
# print Dumper($booklist);

foreach my $book (@{$booklist->{book}}) {
	if ($book->{type} eq 'technical') {
		print $book->{title} . "\n";
	}
}

The script prints out the following:
Book 1 title
Book 3 title

It's that simple!