Many people struggle when it comes to figuring out how to best work with XBRL. Typically their first knee-jerk reaction is to take the tool that they have in their toolbox which usually includes a relational database, read an XBRL instance file, and pack the XBRL technical syntax into a relational database.
Then they sit back an wonder why it is so hard to work with the XBRL-based information and they immediately blame XBRL for their problems.
This blog post walks you through a number of options for storing and working with the digital business information such as SEC XBRL financial filings.
XBRL is a global standard for information exchange, not querying
The very first thing one needs to understand is that XBRL is a global technical syntax for the exchange of business information. XBRL is optimized for extensibility, not for querying. XBRL instances, XBRL taxonomies, and XBRL linkbases are all very "flat". The XBRL information is rich with meaning, but that meaning is not exposed until you feed the raw XBRL technical information exchange syntax into an XBRL processor. That XBRL processor sorts out the information, pieces together the relationships, and then makes the information easier to work with.
Now, generating information which is easy to work with is up to the software vendor using the XBRL processor. Unfortunately, there are some software vendors who make the mistake of not even using an XBRL processor and ultimately end up realizing that they built a almost a complete XBRL processor only after they realized all that is necessary to work with the XBRL-based information.
The "easier to work with" form is different for most XBRL processors. Why? Well, because there is no standard for that form. Now, XBRL International has made progress toward a standard intermediate format in the XBRL International Abstract Model 2.0. Find information about that here and here.
The second reason is that most people miss the point and don't realize that the needed for is a form tailored to business users querying the information. In my view this is a communication problem because business users don't seem to understand what they want and need and technical people don't understand how to talk to business people and get them to tell them what they really need.
Spectrum of options
Here is the spectrum of options that I see for storing and querying digital business information. Walking through this helps one understand the important pieces needed from a business information database which holds XBRL-based information. I will use SEC XBRL financial filings to make my point, but any set of digital business reports has similarities to SEC XBRL financial filings:
- Internet file system of XBRL files: The first option is simply providing a listing of files such as this RSS feed provided by the SEC which points to XBRL instance files. Here is the list of the different RSS feeds. Easy enough. Read the RSS feeds you need, grab each XBRL instance, and there you go. You see the first problem. It takes me about 6 hours to iterate through the approximately 8000 SEC filings received each quarter. Running a query to say get "Assets" for each reporting entity does not perform well under these circumstances.
- Local file system XBRL files: No problem, read the files once which anyone would have to do to get the information and cache the files locally. Good idea, but that gets you a process that takes about an hour to do a query for "Assets".
- Local relational database, XBRL files: Next obvious choice is to take those local XBRL files and rather than work with them as files, take each of the files and put the information into a relational database. This is pretty easily actually, XBRL files are very flat and you can read the contents and put the information easily into a relational database. Doing a query to get the value of "Assets" for all 8000 filings takes seconds. Not bad, query goes from 6 hours to seconds! Things are looking up!
- Local files, business information infoset: The problem with the relational database packed with XBRL technical syntax is that you have two issues: (a) the information is in the XBRL technical syntax and (b) working with the information requires you to basically do what an XBRL processor does and properly connect the information together each time you execute a query. That does not perform very well. The answer: process each XBRL-based financial report and resolve the relationships and sort out all the other things which need to be sorted out; and then store that intermediate format and query against that. Looking at three things will help you understand better. First, this is an XBRL instance. There are three important things to notice: (1) the information is "flat", not a lot of hierarchy to grab ahold of, (2) the information is severly "disjointed" and you have to bounce around all over the place to put the pieces together, (3) certain information, specifically the "dimension-defaults" are "missing" from the context information; watch the video on this blog post to understand why this information seems missing. And now take a look at the information organized into XML infosets arranged for business meaning and ease of querying: here are the facts, here is the model structure. Notice how both the facts and the model structure have some hierarchy to work with. Notice how all the information, including those "dimension-defaults" which are missing from the actual XBRL instance. These XML infosets are similar to the XBRL International Abstract Model 2.0 format. OK, so the information is now easy to query, but we are back to another problem: we are dealing with files and not information within a database. (Note: I believe that these infoset need to be more in line with the terms of the Financial Report Semantics and Dynamics Theory terminology.)
- Business information infosets in a document database: No problem, take the well organized XML infosets designed for extracting business information, put them into a document oriented database such as the free, open source MongoDB, or Couchbase. Other databases will do. Remember that query for "Assets"? Works just as well as running it from a SQL database. One of the very significant advantages is that the report components which have fact tables which have different "shapes" because different information has different characteristics associated with them and causes problems when you want to put these different shaped fact tables into a relational database which wants consistent structures; well that is no longer a problem. Why? Because the document databases are basically graphs of information, not the rather inflexible structures of SQL databases. Also, you can do more powerful graph matching within these database as opposed to the more limited SQL queries. But you still have a rather significant problem. What query language do you use to query this document databases?
- JSONiq: JSONiq is a standard query language. JSONiq is called "the SQL of NoSQL databases". That query of "Assets"? Not a problem. But JSONiq is just a query language. Someone has to connect the query language to the document database. You can do that yourself, or someone can do that for you.
- JSONiq on top of business information infoset in a document database: The company 28msec.com created an information processing platform called 28.io. That combines a JSONiq query engine, a document database, and the business information infosets created from the XBRL instance and XBRL taxonomy. Plus, the 28.io processing engine lets you do queries using JSONiq, SQL, XQuery 3.0, or SPARQL! See this PDF for more information. Or see this video. But there is still a slight issue, this platform does not inherently understand XBRL.
- Digital business information specific query language: What if the 28.io platform did inherently understand the XBRL technical syntax and the business information infoset format which is better for querying XBRL-based business information? What if a digital business reporting platform could convert from XBRL, to the business reporting infoset, to RDF/OWL or to whatever technical syntax you might desire on the fly? On the SemanticXBRL LinkedIn discussion group, one of the members envisioned a "Semantic Extensible Modeling Framework" which combined the XBRL and RDF/OWL and perhaps even other syntaxes info one framework.
There are more types of databases. There are XML databases. (SoftwareAG has an XML database that I am familier with.) There are object databases. There are graph databases. (Neo4j is the graph database that is used by Facebook as I understand it and Arelle is looking to integrate with that database.) There are relational databases that support XML data types. I don't know if hierarchical databases is a type of database or a type of model. There are RDF triple stores.
This link provides information on NoSQL.
Bottom line
What is the point? What is the bottom line?
- If the only tool you have in your tool box is a hammer, then everything looks like a nail.
- You will need some sort of database if the number of XBRL instances containing business information grows to any substantial size.
- Storing your information in a database using the XBRL global standard syntax meant for exchanging information is a fool's errand or wild goose chase. Having to convert that information into a querible format each time a query is desired...really?
- Sorting business information using the semantics of that business information makes a lot of sense.
- Querying business information using a syntax tuned to querying business information makes a lot of sense.
- Graph matching query languages are much more powerful than SQL. Personally, I like SQL. Have used it for years. But SQL is not the only tool in my toolbox. I also like XQuery. SPARQL is hard to use, at least for me.
Stay tuned folks, there is more to come! But now I need to go for a bike ride.