Latest web development tutorials

Python XML parsing

What is XML?

XML means Extensible Markup Language(e X tensible M arkup L anguage). You can learn through this site XML Tutorial

XML is designed to transmit and store data.

XML is a set of rules to define the semantics of tags, these tags will document divided into many parts and these parts to be identified.

It is also a meta-markup language that defines the syntax of the language used to define other domain-specific, semantic, structured markup language.


python for XML parsing

Common DOM and XML programming interfaces SAX, two different interfaces with XML files the way, of course, the use of different occasions.

There are three ways python parsing XML, SAX, DOM, and ElementTree:

1.SAX (simple API for XML)

python standard library contains SAX parser, SAX with the event-driven model, triggered by one event in the process of parsing XML and calling user-defined callback functions to handle XML files.

2.DOM (Document Object Model)

The XML data is parsed into a tree in memory, operating through the tree to manipulate XML.

3.ElementTree (element tree)

ElementTree as a lightweight DOM, with a convenient and friendly API. Code availability, fast and consume less memory.

Note: Due to DOM need to map XML data into memory tree, one slow, the second is more consumption of memory, SAX streaming reads the XML file faster, take up less memory, but requires the user to implement callback (handler ).

Use this section to an XML instance document movies.xml reads as follows:

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

python xml parsing using SAX

SAX is an event-driven API.

Use SAX parsing an XML document involves two parts: the parser and event handler.

The parser is responsible for reading XML documents, and sends event event handlers, such as elements begin with the element end event;

The event handler is responsible for the event accordingly, XML data transfer for processing.

  • 1, the processing of large files;
  • 2, only part of the contents of the file, or simply to obtain specific information from the file.
  • 3, want to build their own object model of the time.

Use sax manner xml xml.sax first introduced in the parse function, as well as the ContentHandler xml.sax.handler in python.

ContentHandler class method introduced

characters (content) method

The timing of the call:

From the beginning of the line, before experiencing the label, there is a character, content value of these strings.

From a label, a label before the next encounter, the presence of the character, content value of these strings.

From a label, before encountering a line terminator, the presence of characters, content value of these strings.

Tag may be the beginning of the tag, it can be the end of the label.

startDocument () method

Documentation startup called.

endDocument () method

When the call reaches the end of the document parser.

startElement (name, attrs) method

Call encountered XML start tag, name is the name of the tag, attrs is a dictionary property value tag.

endElement (name) method

Call encountered XML end tag.


make_parser method

The following method creates a new parser object and returns.

xml.sax.make_parser( [parser_list] )

Parameter Description:

  • parser_list - optional parameter, parser list

parser method

The following method creates a SAX parser and parse xml document:

xml.sax.parse( xmlfile, contenthandler[, errorhandler])

Parameter Description:

  • xmlfile - xml file name
  • contenthandler - must be the object of a ContentHandler
  • errorhandler - If this parameter is specified, errorhandler must be a SAX ErrorHandler Object

parseString method

parseString method creates an XML parser and parse xml string:

xml.sax.parseString(xmlstring, contenthandler[, errorhandler])

Parameter Description:

  • xmlstring - xml string
  • contenthandler - must be the object of a ContentHandler
  • errorhandler - If this parameter is specified, errorhandler must be a SAX ErrorHandler Object

Python parsing XML instance

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import xml.sax

class MovieHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # 元素开始事件处理
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print "*****Movie*****"
         title = attributes["title"]
         print "Title:", title

   # 元素结束事件处理
   def endElement(self, tag):
      if self.CurrentData == "type":
         print "Type:", self.type
      elif self.CurrentData == "format":
         print "Format:", self.format
      elif self.CurrentData == "year":
         print "Year:", self.year
      elif self.CurrentData == "rating":
         print "Rating:", self.rating
      elif self.CurrentData == "stars":
         print "Stars:", self.stars
      elif self.CurrentData == "description":
         print "Description:", self.description
      self.CurrentData = ""

   # 内容事件处理
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content
  
if ( __name__ == "__main__"):
   
   # 创建一个 XMLReader
   parser = xml.sax.make_parser()
   # turn off namepsaces
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # 重写 ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler( Handler )
   
   parser.parse("movies.xml")

The above code is executed as follows:

*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom

Complete SAX API documentation please refer to the Python SAX APIs


Use xml.dom parse xml

Document Object Model (Document Object Model, referred to as DOM), it is a W3C-recommended treatment Extensible Markup Language standard programming interface.

In a DOM parser to parse an XML document, read the entire document at once, all the elements of the document saved in a tree structure in memory, then you can use the DOM to provide different functions to read or modify the document content and structure to be modified to write the contents of the xml file.

python with xml.dom.minidom to parse xml document, examples are as follows:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

from xml.dom.minidom import parse
import xml.dom.minidom

# 使用minidom解析器打开 XML 文档
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
   print "Root element : %s" % collection.getAttribute("shelf")

# 在集合中获取所有电影
movies = collection.getElementsByTagName("movie")

# 打印每部电影的详细信息
for movie in movies:
   print "*****Movie*****"
   if movie.hasAttribute("title"):
      print "Title: %s" % movie.getAttribute("title")

   type = movie.getElementsByTagName('type')[0]
   print "Type: %s" % type.childNodes[0].data
   format = movie.getElementsByTagName('format')[0]
   print "Format: %s" % format.childNodes[0].data
   rating = movie.getElementsByTagName('rating')[0]
   print "Rating: %s" % rating.childNodes[0].data
   description = movie.getElementsByTagName('description')[0]
   print "Description: %s" % description.childNodes[0].data

Results of the above procedures are as follows:

Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom

Complete DOM API documentation please refer to the Python the DOM APIs .