This blog post summarizes information about a machine learning problem that a group of us is solving. If you want to find out more or participate, please contact me.
PROBLEM: The problem is that XBRL-based financial reports submitted to the SEC using US GAAP or IFRS do not have explicit identifiers in them to allow you to easily extract information from the report. For example, you cannot simply say "give me the balance sheet" or "give me the reconciliation of income taxes between expected statutory rates and actual rates." You have to provided more detailed rules to overcome the missing unique identifiers.
SOLUTION: The solution is to use prototype theory to identify the pieces that make up the specific disclosure within an XBRL-based financial report and then use that information to identify each specific disclosure so you can extract that information. To do this you need:
TECHNIQUE: The objective is to get every disclosure and the rules to identify that disclosure in XBRL-based financial reports.
- Starter list of companies: Here is a list of US GAAP XBRL-based financial rerports, all 10-Ks in machine readable RSS. We will add companies to that list as the process gets dialed in.
- Start with one rule for a common disclosure: Start with this first rule which 100% of companies have this "document information" disclosure.
- Read XBRL presentation relations: If you iterate through each of the reports in the list (#1) and change the name from "*.xml" to *_pre.xml", so the Microsoft 10-K XBRL instance, you get the company XBRL taxonomy XBRL presentation relations that supports that Microsoft 10-K, this will give you a sense for the task.
- Read XBRL calculation relations: Changing the XBRL instance from "*.xml" to "*_cal.xml" will give you the XBRL calculation relations. Those have roll up information that is very helpful in identifying disclosures that are roll ups.
- Use information in rules to find disclosure: Use a disclosure rule to find the network that contains the disclosure that matches that disclosure rule. Mark the network indicating that what is represented by the network (i.e. which disclosure). Once you discover what a network represents, you don't ever need to read that network again.
- Repeat: Repeat this process for every report and every disclosure. Ultimately, we will add more reports and more disclosures.
- Back into new rules using machine learning (likely clustering): Steps 1 through 6 are just a mechanical process, but it helps you truly understand the task. Once all the disclosures for which there are rules have been identified (this is is the result); the task is to get the machine learning AI to use this information to find smilar pattrens using clustering.
- Name and tweak machine learning results: The AI will likely be able to find disclosure patterns, but it will not be able to give that pattern a name. Humans will do that. The named disclosure gets added to the list along with the rules to discover the disclosure.
So that is the initial process. Once this process is understood, we will build on that process. The networks that contain the presentation relations will work to a degree and is a good starting point; but using the network level will fall apart. Why? Because sometimes networks contain multiple disclosures. The solution? Use the "fact set" level.
Eventually, we will want to tune this further by separating the reporting entities into different lists. For example, banks have different disclosures that software companies. Insurance companies are unique from banks and software companies. Understanding these uniquenesses helps one understand how this information will be used.
Once we get all this working for US GAAP reports; we will repeat this process for IFRS reports.
Article originally appeared on XBRL-based structured digital financial reporting (http://xbrl.squarespace.com/).
See website for complete article licensing information.