My first pass yielded an information accuracy rate of 95%. The next pass pushed that up to 96.6%. The next realized 97.3%. Focusing only on commercial and industrial companies, basically creating sub categories but focusing on that primary category, yielded 98.4% accuracy. These incremental improvements were realized by tuning my algorithm and adding additional metadata used by that algorithm.
All this work ultimately resulted in a 98% overall accuracy rate for the extraction of fact values for 51 fundamental accounting concepts proven by the 21 relations between those facts expressed within SEC XBRL financial reports. This graphic provides a summary of these results:
All this work also resulted in a more precise understanding of why SEC XBRL financial filings to not pass the 21 tests which I have specified and ultimately in an inability to reuse information from an SEC XBRL financial filing using automated processes. That is the goal: robust, reliable, predictable automated reuse of the information.
And while I do want to be able to reuse the SEC XBRL financial information, the over-arching goal is an understanding of how to build systems which make use of XBRL-based information which are robust, reliable, predictable and the information within the system can be reused dependably. Without that, what good is the system?
If you have not tried it already, experiment with the Excel-based prototype for extracting information using my algorithm for grabbing SEC XBRL financial information. Also, you can see the algorithm which I am using.
The following is a summary of the specific issues which I have encountered which cause information reuse issues:
- Discovery of the root reporting entity for a small number of filers (about 58)
- Significant variability in the concept used by SEC filers to report revenues.
- Lack of clear totals for the sub categories which make up operating income (loss).
- Lack of clear totals for distinguishing between nonoperating income (loss) and interest and debt expense.
- Filers crossing categories of fundamental concepts (for example, including the total of one category within another category).
- Inappropriate extension of these high-level concepts.