In prior posts I looked at XBRL and databases and basics of querying XBRL-based information. This blog post synthesizes what I have discovered into a concise set of conclusions which I have reached based on what I have discovered. This is a summary of those conclusions:
- Querying from physical files not good for query performance; pre load information into some sort of database: Having to "move" XBRL documents (XBRL instance and XBRL taxonomies) from the SEC web site to your computer to use that information in a query is not efficient in terms of query performance. Information needs to be moved from the SEC web site into some more accessible place, such as a database, significantly improves performance. Basically, this is an internet bandwidth issue.
- Querying an information exchange format is not good for query performance; preprocess information into a more queriable format: XBRL is an information exchange format and is not optimized for query. To achieve good query performance the information contained within an XBRL instance and all related XBRL taxonomies need to be converted from the information exchange format into a more queryable format. Basically, having to "resolve" all the information at the time of the query is inefficient. Staging information you want to query into a more queryable format vastly improves query efficiency because the XBRL processor does not have to "resolve" all the relations at query time. Everything was resolved in advance.
- Querying business information without some sort of query language makes queries harder to create; use an information set specific query language: Extracting information from some database with properly formatted information is still hard for business users and even developers. Or said another way, if a business user or developer had a information set specific query language, querying would be much simpler. This includes setting up the correct indexes, organizing additional useful metadata correctly, understanding the different "joins" which are commonly used, etc.
- Querying business information without having dealt with information set idiosyncrasies in advance can make querying problematic; resolve these idiosyncrasies in advance: A simple example will make this point clear. SEC XBRL financial filings are submitted to the SEC. Those submissions can be amended. If you do a query against an information set which has both the original filing and the amended filing which updates that original filing and this idiosyncrasy has not been dealt with you will get incorrect query results. Dealing with this idiosyncrasy at the time you perform the query reduces query performance. Resolving them in advance improves performance. This is not the only idiosyncrasy of SEC XBRL financial filings; it is an example of many such idiosyncrasies. For example, another SEC EDGAR system idiosyncrasy is that searching for "us-gaap:Revenues" is not as simple as it seems because many different concepts could have been used to report revenues.
Having efficient and effective queries comes from prior planning. Minimizing system idiosyncrasies and being sure you provide what is necessary to easily extract meaningful information from your XBRL-based information set is best done proactively rather than reactively. Test, test, and then test some more!