(Note that I updated the S&P 500 information to correct an issue where I was not finding "Equity" for a handful of reporting entities. This correction is reflected in the current data set.)
As part of some other things that I am doing and some experimentation I have created two prototypes of using SEC XBRL information provided in 10-Ks. I am only trying to use very basic financial information (assets, equity, revenues, net income, net cash flow), but I think these helpful in seeing the possibilities.
Here are links to the two prototypes:
Information extracted from SEC XBRL financial filings was done without the help of an XBRL processor. What I am trying to do is see how reliably I can extract very basic financial information. Places where I am having extraction issues are clearly indicated. What is NOT indicated as well is where I am pulling the wrong information. For example, it is hard to know for sure if I am getting the "revenues" numbers correctly because of the way filers put this fact in their SEC XBRL filings. I know that revenues for American Express Company is incorrect because they break out non-interest and interest revenues and do now provide a total (i.e. most companies provide a total for revenues using a common set of concepts).
Other issues related to uncommon uses of [Axis] cause issues. For example, while most filers use the legal entity [Axis] and indicate the legal entity as either the consolidated entity or parent holding company; this SEC filer does something radically different, (a) they use the name of their company as the value of the legal entity [Axis], but complicating things even more they (b) do not make this the default dimension. This sort of inconsistency makes using the data much more complicated and increases the risk of picking the wrong information to use. This filer does something slightly different. Personally, I see these sorts of inconsistencies as both unnecessary and they clearly increase the risk of automating the reuse of the information.
The first prototype; the summary information for the Dow 30, Fortune 100, and S&P 500; shows the error rate to be fairly low for these key pieces of information. For the Dow, there are 150 pieces of extracted information (30 companies times 5 data points) with only one occasion where I could not find the fact which I was looking for. General Electric chose to muck up the works by providing an extension concept for "net cash flow". I am NOT saying that all the numbers are 100% correct. That is a lot of work to test and I am not to that point yet. But, finding things which seem to work are a very good first step to achieving the XBRL vision of reusing the information.
Likewise, the error rates for the Fortune 100 and S&P 500 are fairly low. I calculate a .6% error rate for the Fortune 100 and a 1.9% error rate for the S&P 500. Not bad, but again; any error rate of more than 0% will yield a less than satisfactory result. There were some other issues relating to the SEC RSS feed which showed themselves from trying to use the S&P 500 information. A number of filings don't seem to show up in the SEC RSS feed. Not sure why, but this is sure annoying. Also, I have some duplication of some companies. I have not yet gotten to the bottom of how that was caused, still working on that.
The raw data is provided in Excel. Fiddle with it. If you find anything interesting please let me know.
The second prototype shows even more possibilities. The S&P 500 Additional information links to a number of other web pages creating a nice mashup. Most of the information which I used came from the Wikipedia list of S&P 500 companies web page. What I had to do though was manually put the SEC CIK number on the Wikipedia list in order to cross reference the information which I had with the information Wikipedia had. The reason is (a) the Wikipedia information did not provide the CIK number which was the key I had to use and (b) the SEC filings did not provide the company ticker symbol for every company nor did they provide the exchange on which the stock was traded.
Metadata like this CIK number is critical for putting lists of things together. Another piece which I added (and I am not done yet) is the auditor. That is not provided anywhere in the XBRL. I had do go read the HTML page where the name of the auditor does exist. Perhaps the audit report will eventually be expressed using XBRL and then the auditor will be easy to grab. Be way, way easier to use this information if the SEC required it in the SEC XBRL financial filing. For example, I have all sorts of interesting information about the generator software used to create the SEC XBRL filing. That is provided by software vendors and can be gleaned from the XBRL (it is in a comment).
This stuff is going to be so useful (and cool!) when it works correctly. By looking at this sort of prototype it is easier to see the gaps between what exists and where we will end up.