Many Different Forms of XML
This is a series of posts where I am providing information relating to figuring out what the best data format to use and why. Basically, when is XML better, when is XBRL better, and when is RDF/OWL better.
In another blog post I looked at different information exchange formats. In that post I mentioned that the world was standardizing on XML. But which form of XML? XML can come in many, many different forms.
I took a small data set which I had in a database and generated XML from that data set. The data set is simple enough: the population of each U.S. state. This PDF shows what the data set looks like in a rendered format.
Simple enough, here are some XML which I generated from the same Microsoft Access database information:
- Variation 1 of Traditional: This form of XML uses elements (rather than attributes).
- Variation 2 of Traditional: This form of XML uses one element 'State' and the values are attributes.
- Variation 3 of Traditional: This form of XML is much like the first variation, but using some different element names and a slightly different configuration.
- Variation 4 of Traditional: This form of XML is much like the first and third variations, but using different element names.
- Variation 5 of Traditional: This form of XML is much like the second variation using attributes, but different element names are used. Also note that the ID is the abbreviation of the state name.
- Microsoft Access Auto XML: This form of XML was auto-generated by Microsoft Access. This is an XML Schema for this form of XML.
- Microsoft Excel Auto XML: This form of XML was auto-generated by Microsoft Excel, it is the Excel XML format.
- RDF/OWL: This form of XML is RDF/OWL.
- XBRL: This form of XML is XBRL in a very simple form. This is an XML Schema for this XML. This is a validation report that shows that the population of the individual states adds up to the total population. (I will explain this in a bit.)
So, what is the point here? Well actually, I have several points which I will list and discuss.
- Every one of those forms of XML represent the exact same set of information, the information which you can see in that PDF. While the syntax of each of the files (the different XML forms above), the semantics of the information (the meaning of the information) is exactly the same.
- Some information is expressed more explicitly than others in each of the different forms of XML. For example, the population data is an estimate as of July 1, 2008. The point though is that fact (that the information is estimated and what point in time) is sometimes very explicit, other times somewhat implicit within the different forms of XML.
- The populations of each state are supposed to add up to the total for all the states. Here is another version of the first variation of XML with an error in it. Can you see the error, the last two digits of the total have been transposed. Different formats are better at communicating the fact that the information adds up than others. Meaning, you could in XBRL communicate that the information adds up quite easily, and get a report which shows that the information does add up. This is a validation report.
- The states are related to each other in different ways. For example, you can break down the states by say region: South, Northeast, West, Midwest, and so forth. That information is not communicated in any of these XML formats. However, any of these forms of XML could communicate that information, in XML, in whatever way they may desire.
Which form of XML is the best? Well, that all depends on what you need from the information all things considered. On the one extreme, if you just want to make a simple set of information available to a small group of people, any old XML will do. In fact, you could use pretty much any data format. But XML works well over the Web, it is in vogue, it is a good general format.
If you are, say, a government agency or other enterprise and you want to work with one data set and you don't need to exchange that information with other government agencies and you will only have one data format, traditional XML could work for you. But what if you want to verify that numeric information adds up correctly? Well, you could build your own validation mechanism because your data set is small and you don't have complex computations.
But how many government agencies or other enterprises don't have to interact with other government agencies or enterprises, subsidiaries, etc? If you interact with others, you have to agree. To agree, you need some sort of framework to agree on. For example, the National Information Exchange Model (NIEM) is a framework to help government agencies involved with public safety and security to create XML which is easier to share. The framework adds discipline to creating their XML formats. Rather than each agency creating point solutions to exchanging information; the framework provides the discipline needed to create a canonical standard format which makes exchanging information easier. (Their introduction document does a great job of explaining this.)
XBRL is also a framework for agreement. For example, the US GAAP Taxonomy Architecture is part of a framework for using XBRL in a specific way, creating what amounts to an application profile (i.e. no XBRL tuples, no XBRL typed dimensions, no use of the XBRL scenario context element, build [Table]s in a specific way, etc.) Also, the XBRL framework provides mechanisms for achieving things which are commonly needed in business reporting. For example, it provides the ability to: add labels, add multiple labels, express computations between numeric information, express additional types of relations between concepts, etc. If you need this and you are using XML, you would have to build these things yourself.
Sharing information to a large number of users is one thing. While a framework helps make these systems work better, what if you want to connect information between all these systems? Some people using traditional XML, some using XBRL, some using other formats. That is what RDF/OWL and the Semantic Web are all about. For example, this Data.gov project has converted numerous data sets into RDF/OWL. (This is a great book for understanding how the Semantic Web will be changing your life.)
The bottom line here as I see it is this: When you build your information exchange systems, be sure you are considering the right things for the long term. I see four groups of XML:
- XML unconstrained by a framework (ad hoc XML)
- XML constrained by some framework
- XBRL, one specific type of XML framework for a specific purpose (This blog posts helps you see how XBRL builds on top of traditional XML)
- RDF/OWL
This is not to say that one type of XML is better than another, it is more about understanding what you need to be considering when you try and determine your needs. Using the wrong type of XML is like trying to fit a square peg in a round hole. You can do it, but it pretty.
Reader Comments