Understanding SQL-type Queries as Compared to XQuery-type Queries
I have been trying to understand the specific differences between:
- a relational-type query using SQL
- an XQuery-type query
- a SPARQL-type query
Why? Every tool has strengths and weaknesses. It blows my mind how many people love Microsoft Excel but have never even tried Microsoft Access. Excel and Access are two different tools, each has pros and cons. Many Excel users who only know Excel think that Excel is the be-all, end-all solution for everything. It is not. Access is a vastly superior tool for many things, vastly out performing Excel. As is said, "If the only tool you have in your toolbox is a hammer, then everything is a nail."
Information has different forms and databases that store that information likewise can take many forms. Relational databases is one very popular data storage system. SQL is one of the most useful things that I have ever used.
Another useful thing is XML.
XML and relational databases are different. As I understand it there are three primary schemes/models/representations for formatting and processing structured (semi-structured, highly-structured) information:
- Table-type (homogeneous, tabular, consistent): relational databases, CSV, spreadsheets, or tabular-type representations which allow only one level of hierarchy within each table; but hierarchies can be constructed by relating tables
- Tree-type (heterogeneous, arborescent): XML, XBRL (using tuples), JSON and other tree-hierarchy-type information which allow for the expression of one hierarchy
- Open-type or Graph-type (heterogeneous, arborescent): RDF, EAV, XBRL (using dimensions) and other open schema-type or graph-type representations which are more graph-oriented and allow for dynamically creating virtually any number of hierarchies; supports the notion of cycles; very flexible
Now, not surprising, there are three primary query languages which match the type of information representation scheme: SQL, XQuery, and SPARQL.
I don't want to get into SPARQL right now. I want to understand the difference between SQL and XQuery.
I stumbled across this paper: PERFORMANCE ANALYSIS OF XQUERY VS. SQL. The abstract of that paper says the following: (emphasis is mine)
Early in the XML history there were thoughts about whether XML is sufficiently different from other data formats to require a query language of its own, since SQL was already a very well established standard for retrieving information from relational databases.
But there were some differences that justified a new query language for XML data. Relational data is 'flat.' This means it is organized in the form of a two dimensional array of rows and columns. XML data is 'nested' and its depth of nesting can be irregular. On the one hand, relational databases can represent nested data structures by using tables with foreign keys, but it is still difficult to search these structures for objects at an unknown depth of nesting. On the other hand, in XML it is very natural to search for objects with an unknown position in a document. There are other differences, which also convinced the W3C workgroup to design a new XML Query language with a more efficient semantic definition rather than extending a relational language. It would be useful to focus research on programming an Analysis tool between nested databases and relational databases to show performance and scalability with respect to data volume and complexity for both types of queries.
The "Group By" clause in SQL is not easily replaced by features of XQuery. The thesis shows ways to achieve similar functionality and demonstrates that the current XQuery implementations are either non-compliant to the XQuery standard or much slower than SQL when performing "Group By" equivalent requests.
The direct result of this thesis was an Open Source tool, PerfanX, and the performance statistics gathered from a variety of queries run under that tool. The tool uncovered several counter-intuitive performance implications. The study also exposed several bugs in XQuery implementations. The author has reported these to the development groups.
Now, that article was written in 2004 when only XQuery 1.0 existed. Subsequent to this article, grouping was added to XQuery as part of XSLT 2.0. Now we are on XQuery 3.0.
An obvious question might be, "wouldn't it be better if there were ONE query language which supported all three information representation formats?" Well, others had that question. See Translating SPARQL and SQL to XQuery. I don't understand the paper, but I did speak to some people about this and they say that XQuery can represent anything that SQL and SPARQL can represent.
So where am I on this? It seems to me that a relational database can represent anything, but both relational databases and SQL were designed for tabular-type information, rows and columns. Tree-type XML and open-type XML is far richer in its ability to express information because of the ability to express hierarchies. So while it is possible to get SQL to do what XQuery was designed to do, query such hierarchies; the SQL and the databases get increasingly complex as the information becomes richer.
Bottom line: XQuery does a better job querying tree-type information and because of the richness of the information there is a lot more that you can do with the information because of (a) its inherent richness and (b) the features of XQuery to work with hierarchical information. Relational databases and SQL are far more mature than XQuery and have other features which XQuery does not have. The last sentence in that paper mentioned above by the author is this:
XQuery provides a powerful way to query this information and shouldn’t be underestimated.
I don't know that this is an either/or type question. Maybe it is, maybe not. Another outstanding question is performance. I want to personally see the performance.
Another outstanding question is the specific differences between SPARQL and XQuery. Stay tuned!
Reader Comments