OK, I am getting closer and closer to where I need to be in terms of understanding format/database/query options and related issues/opportunities. Conversations with others has provided additional insights and information. Building on this prior post:
People tend to agree that there are three primary formats for representing/modeling highly structured and semi-structured information and that there are different types of databases or other approaches to storing this information:
- Table-type format (homogeneous, tabular, consistent): relational databases, CSV, spreadsheets, or tabular-type representations which allow only one level of hierarchy within each table; but hierarchies can be constructed by relating tables
- Tree-type format (heterogeneous, arborescent): XML, XBRL (using tuples), JSON and other tree-hierarchy-type information which allow for the expression of one hierarchy; hierarchical type databases, object type databases (Note that a tree is a special case of graph. See here to understand the difference)
- Open-type or Graph-type format (heterogeneous, arborescent): RDF, EAV, XBRL (using dimensions) and other open schema-type or graph-type representations which are more graph-oriented or network-oriented (many-to-many) and allow for dynamically creating virtually any number of hierarchies; supports the notion of cycles; very flexible; network type databases; RDF triple stores.
People tend to agree that data formats are 100% interchangeable. A "tree" can be expressed in the form of a table and put into a relational database. A "graph" can be expressed in the form of a table and likewise be put into a relational database. A "table" can be expressed in the form of a tree or graph. These formats are syntax and any of these syntaxes can be used to store any type of information.
Now performance is important. Performance is created in any data format via the use of indexes. Relational databases have indexing capabilities, hierarchical databases have indexing capabilities, and network type databases have indexing capabilities.
People tend to agree that information is more interesting than data. It is not data that people are after, people are after information for the sake of knowledge. Relational databases are popular because of the 'relational' piece, not the 'data' piece. Relational databases are about, in part, organizing sets and relating one set with another set, getting the answers to questions back fast and easy across multiple sets. This is about using information with other information, comparing information. It is not about having a store of data. It is about making use of that data.
People tend to agree that the volume of information is growing rapidly. "Big data" is one of the new buzz words. Ask people what "big data" means and the average business person probably could not tell you, but they will tell you that they need some because some software vendor says everyone needs big data! I have heard to good definitions of big data. The first definition is:
BIG DATA: Data that's disparately located, varied in structure, voluminous, and rapidly changing.
The second definition is something like this:
BIG DATA: Data that is generated by machines. The data is "big" because the machines can generate the data faster than humans can consume the data. Humans really cannot create big data.
I would synthesise those two definitions and some other things that I know into the following explanation of what big data is:
"BIG DATA" is the notion that you no longer have the luxury of treating one database as 'the database' and putting all of the information you have into that one database. Data that you need exists within your organization and external to your organization. The data varies in the representation format (table, tree, graph). It varies by system (Windows, Mac, Linux, etc.). It varies by structure. The volume of information is high and it is getting higher. The velocity which data grows is increasing rapidly. Some of the information changes rapidly. Some of the data is generated by machines faster than humans can consume it. Welcome to the information age!
People tend to agree that there are four things which make it possible for one system to interoperate with another system. Another way to look at it is that HOW two systems CAN interoperate. The clarity of an interaction is determined by these four things:
- Classification system used
- Power of the technical syntax to express information
- Business rules used to force information quality to be high
- Interoperability between systems (system, information syntax, information structure, domain semantics, process/workflow protocol)
This seems like a lot of work. Why the heck would you want to even do this? The answer to that question is simple: automation. This is all about enabling computers to do more work. Again, welcome to the information age.
People tend to agree that financial information tends to be "flat" and it seems to fit nicely into spreadsheets and relational database tables. Some believe this so much so, they wonder why XBRL isn't more like CSV.
While financial information does seem to be flat and you can put it into spreadsheets and relational databases; each report component of a financial report tends to have a different shape or could be a different shape and therefore you tend to have to put reach report component into a different spreadsheet.
For example, take a look at this financial report which is expressed digitally. Here is the rendering of how a human might work with the information. Here is the underlying fact table which shows how the information might be stored in a database. This is where people tend to stop thinking about that is stored. But, you also need to store the information about the representation, I call it the model structure. And you will also want to store the business rules which help communicate important relations and verify that the relations are intact.
So you can see that while each piece of a financial report can be seen as tabular, each table is potentially different and is determined by the different "dimensions" (or [Axis], or Aspects, or characteristics, people use different terms) associated with the information.
That is what causes the different information shapes, different dimensions used on different report components of a financial report. For example, the "Business segment dimension" is appropriate for the "business segment breakdown" but it is not used on the balance sheet or inventory components breakdown. So, that is why dimensions can vary in tables/spreadsheets, that is what causes the need to store the differently shaped report components in different relational database tables.
Now, some people have the *brilliant* idea of not using those dimensions, let's just ignore the fact that the dimensions exist. Really? This idea is so absurd that I am not even going to address it. Financial information has dimensions.
And so, while financial information can be considered tabular and that tabular information can fit into a spreadsheet or relational database, ever report component needs to be stored in a different table because every report component could have a different shape. No problem, you can create as many tables as you want in a relational database.
Or rather, the IT department can add that new table for you.
To add a new dimension in a relational database, what do you have to do? You have to modify the database schema. Bad for business users, that is an IT task. Most business users cannot modify database schemas, create the proper indexes, etc. Now, it is possible to automate some or much of the task of adding tables, indexes, and all the other stuff needed to modify a database schema. But that takes programmers writing code. And writing code takes time and costs money. So business users have three choices: (a) learn to make these sorts of changes to relational databases themselves, (b) wait for the IT department to do it for them, (c) pay for the creation of software which makes these technical oriented adjustments for them.
People tend to agree that there are different query languages:
- SQL: Structured Query Language, a global standard query and functional programming language used by relational databases. SQL queries table-type data.
- XQuery: XQuery is a global standard query and functional programming language that is designed to query and transform collections of structured and unstructured data, usually in the form of XML or text. XQuery queries tree-type data.
- JSONiq: JSONiq is an open, third-party extension of XQuery which, among other things, expands XQuery to be used with JSON formatted information.
- SPARQL: SPARQL Protocol and RDF Query Language is a global standard RDF query language, it is used to query graph-type or open-type data formatted in RDF.
- MDX: Multidimensional eXpressions, a query language for OLAP databases. This was first developed by Microsoft but it seems to be sort of a standard, it does seem to be popular.
People tend to agree that SQL is a fantastic query tool. Personally I love SQL. And, because we said above that any information can be expressed using any data format, clearly we can express all these tables in a relational database. You can create foreign keys and relate any relational database table to any other relational database table. No problem there. IT can do that for you or you can do that yourself.
However, the queries you have to create to get information out of that relational database get increasingly complex with more and more of these types of relations. Further, the database schema becomes increasingly hard to understand because how the information is structured is really not that intuitive.
Some people believe that you can convert SQL and SPARQL queries to XQuery. I don't totally grasp this presentation.
What does this mean?
People don't tend to realize that trying to make a relational database do things that it was not really built do do causes complexity which makes things harder to understand, harder to create, and harder to maintain.
Now again, technology people can hide some, or perhaps even all, of this complexity from business users. But when they have to deal with maintenance issues (adding dimensions, maintain schemas, writing queries, etc.), spend time to write code to automate tasks and make queries easier, and otherwise give business users what they need; that comes at a cost and it takes time.
People tend to agree that relational databases are a very mature, tested, stable, well understood, popular, robust, sophisticated tools. There are a lot of people who understand how to administer relational databases, develop relational databases, maintain relational databases, etc. There are a lot of incredibly useful features which relational databases have such as fault tolerance, commit-rollback, replication, etc. However, relational databases do have their weaknesses. No tool can do everything.
People tend to agree that business users LOVE Excel. Excel is an incredibly useful tool. I am not saying electronic spreadsheets, I am saying Microsoft Excel electronic spreadsheets. Until someone comes up with something better, Excel is the king of the hill. Lotus 1-2-3 WAS the king of the hill at one point, Microsoft changed that; someone else COULD dethrone Microsoft but they have not done so yet.
It seems to me that if business users could keep their spreadsheets, if they could do powerful queries against all that "big data" inside and outside their organization, across any business system, the information would not have any errors in it (i.e. information quality is high) and the IT department could control and maintain everything cheaply and easily then life would be good.
Personally I have zero allegiance to Microsoft or Excel. I do have allegiance to good, useful software. I used VisiCalc, ditched that for Lotus 1-2-3, and then ditched that for Excel because Excel was better.
I am getting tired and the dog needs a walk, so I am going to call this good for now. Stay tuned for Part 2!