By narrowing the focus of my analysis to commercial and industrial companies only (i.e. excluding banking and savings entities, insurance companies, brokers/dealers, real estate) and dropping two tests which I just cannot get get a high pass rate because of the variability within SEC filings; I was able to achieve a 98.4% accuracy rate. Basically, 4,048 of 7,160 filings fit this criteria (57% of total filings). In addition to that narrower set of 4,048 filings passing 98.4% of the tests, a total of 1,113 filings pass 100% of the tests.
Now, narrowing the testing to commercial and industrial companies seems very reasonable. All I am saying by this is that if the total of 7,160 reporting entities were broken into more specific groups of reporting entities and targeting my information extraction algorithm to specific groups of reporting entities the algorithm will be more accurate. Clearly it is easy to grasp why this is true. For example, banking and savings entities report their revenues in different ways than commercial and industrial companies.
However, I don't know whether I should just be discarding two tests which I know should work. The two tests are the following:
- Gross Profit = Revenues - Cost Of Revenue (IS1)
- Operating Income (Loss) = Gross Profit - Operating Expenses + Other Operating Income
The reason that SEC filings don't pass these tests is not that they don't follow the spirit of these equations. Rather, they don't pass because I cannot find the facts which report that fundamental concept. For example, Revenues is found for only 5,640 of the filers which is 79% of the total. That is because of the variability of the US GAAP Taxonomy concepts used to express this fact. I speculate that I can probably improve the success rate if I focused on trying to find more of the concepts used. But, that may not be true.
Also, if I eliminate developing stage companies and companies which have going concern qualifications, then I get 97.3% accuracy for the overall set of 7,160.