An introduction to XML

What is XML?

XML, or eXtensible Markup Language was created by the World Wide Web Consortium (W3C) to overcome the limitations of HTML. While the HTML tags tell a browser how to display this information, the tags don't tell the browser what the information is. With XML, you can assign some meaning to the tags in the document, which can be processed by the machine.

A simple XML File
<?xml version="1.0" encoding="UTF-8"?>
<post id="1" value="post1">
<title>Serialization in Java</title>
<post id="2" value="post2">
<title>serialVersionUId in Java Object Serialization</title>
<post id="3" value="post3">
<title>Serializable vs Externizable</title>

A little bit of XML terminology

  • XML Declaration: XML declaration is recommended but not mandatory
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
The  version is the version of XML usedThe encoding is the character set used in this document. If no encoding is specified, the XML parser assumes that the characters are in the  UTF-8; standalone, which can be either yes or no, defines whether this document can be processed without reading any other files.
  • Tag: is the text between the left angle bracket (<) and the right angle bracket (>). There are starting tags <post>ending tags </post> and self-closing tags <publishedOn date="20-07-2012"/>
  • Element: is the starting tag, the ending tag, and everything in between.  <post id=14> <title>An Introduction to XML</title> ... </post>
  • Root Element: is the first element in your XML file which encloses all the other elements of your XML. Each XML document has exactly one root element aka document element. In above example <blog> is the root element
  • Attribute: is a name-value pair inside the starting tag of an element. In above example,   id="1" value="post1"  
  • Comments: can appear anywhere in the document;  A comment begins with <!-- and ends with -->
  • Processing Instructions (PI): gives command or information to an application that is processing the XML. 
<? target instruction ?>
where the target is the name of application that is excepted to do the processing and instruction is the command or information for the application
NOTE: The XML Declaration at the beginning of an XML document is not a processing instruction
  • Entities: are the alias for a piece of information. The XML spec also defines five entities you can use in place of various special characters. The entities are:
    • &lt; for the less-than sign
    • &gt; for the greater-than sign
    • &quot; for a double-quote
    • &apos; for a single quote (or apostrophe)
    • &amp; for an ampersand.
<!ENTITY name "definition">

<!ENTITY blogurl "">
Anywhere the XML processor finds the string &blogurl;, it replaces the entity with the string

XML document rules

  • Root Element is mandatory. Every XML document must contain only one root element
  • Elements can't overlap - If you begin a <tag2> element inside <tag1>, then <tag2> must end before <tag1>
  • End tags are required or a tag should be self-closing tag
  • Elements are case sensitive - In XML  <blog> and <Blog> are not the same.  If you try to end an <blog> element with a </Blog> tag, you'll get an error.
  • Attributes must have values enclosed within quotation mark.
  • Element Names must follow the following naming convention
    • it can contain any letter or number or special characters
    • cannot contain spaces
    • must not begin with a number or any special character
    • cannot start with xml
  • XML declaration should be the first line in the document, if at all present
  • You should avoid having empty lines in the begging of the document, because few XML parsing API does not excepts such files   

XML Advantages:

  • Easy Information Exchange - XML allows easy sharing of data between different applications  - even if these applications are written in different languages and reside on different platforms.
  • XML enables smart code - XML's rigid set of rules helps make documents more readable to both humans and machines. XML document syntax contains a fairly small set of rules, making it possible for developers to get started right away.  
  • Self-describing dataEvery important piece of information (as well as the relationships between the pieces) can be identified easily.
  • Openness - it allow users to define their own DTDs; these set of tags can be used by the applications very easily
  • Unicode Support enables a wide variety of characters to be represented and communicated.

XML Disadvantages:

  • XML syntax is redundant, this may affect application efficiency through higher storage, transmission and processing costs
  • You cannot have a single generic application for processing different XMLs


Popular posts from this blog

Using Solr Spellchecker from Java

Importing / Indexing MySQL data into Solr

Custom PagingNavigator for SEO friendly URL