This blog post attempts to synthesize and summarize the blog posts: Understanding Database/Query Options Part 1, Part 2, and Part 3. If you are reading through this and desire additional details, go to those three parts, the details are there somewhere. This information leads me to the belief that a new global standard spreadsheet alternative would be a good thing.
We live in the information age which is a result of the digital revolution. The volume of information is growing at an exponential rate. The majority of that growth is expected to be in the form of structured information. This graph below shows one projection of the growth of structured information:
An increasing number of people are referring to all that data as "Big Data". Sometimes people use the term big data to describe large data sets, other times they use it to describe the exponential growth and quantity of data.
This is my explanation of big data:
BIG DATA is the notion that you no longer have the luxury of treating one database as 'the database' and putting all of the information you have into that one database. Data that you need exists within your organization and external to your organization. The data varies in the representation structure/style (table-type, tree-type, graph-type). It varies by system (Windows, Mac, Linux, etc.). It varies by format: Excel, email, RDBMS, RDF, CSV, etc. The applications used to manage the data vary. Some data is unstructured, an increasing amount is structured. The volume of information is high and it is getting higher. This information is an asset which should not be squandered. The velocity which data grows is increasing rapidly. Some of the information changes rapidly. Some of the data is generated by machines faster than humans can consume it.
The cause of this explosion in information, more and more computers in an ever increasing networked world, is also the solution to the problem. We need the machines to help humans create or use this information asset. Help from machines will reduce costs of using this information asset, reduce the time needed to make use of this asset, and increase the quality of the information asset.
People are generally not really interested in data, per se. They are interested in what you can do with the data. They want to understand data relative to other data, to compare information.
- Data is the most basic level; discrete facts or observations, but unorganized and unprocessed and therefore has no real meaning or value because it has no context; for example, "10,000" is data.
- Information adds context to the data; information is data in context, it has meaning; for example, "Sales for ABC Company for 2012 is $10,000 is information.
- Knowledge adds how to use information to the equation; knowledge is a framework for evaluating and interpreting information which makes use of experience, values, expert insight, intuition with the goal of evaluating and incorporating new experiences and information; for example, the sales for every public company organized in useful ways is knowledge.
- Wisdom adds when to use knowledge; wisdom relates to knowing why and doing something useful; wisdom = knowledge + experience; for example, exercising judgment to sell your shares of some stock because the sales relative to the sales of other public companies and relative to other numbers on a financial statement is wisdom.
To get a machine such as a computer to interact with something in the real world you must express things in a form a computer can understand. Something that is both understandable by humans so that the humans can be sure it is right and understandable by computers enables a computer to help humans perform work. The more a computer understands, the more a computer can do for humans. The more clarity this computer readable information holds, the more successful multiple computers will be in sharing and effectively using the shared information.
These machine readable representations, sometimes called conceptual models or domain models, are more information. The machine readable representation distills down to the important "things" which exist within some business domain and the "relations" between those important things. These machine readable representations of a business domain can be more formal or informal. The representations can have a high level of information about the business domain or a low level of information. The representations can be global standard representations or they can be proprietary representations. Achieving high-semantic clarity or a lower-level of semantic clarity is a choice which is made by a business domain. The choice is based on what the business domain needs the system to do.
These relations between the things of a business domain tend to be highly-structured or semi-structured.
There are three primary formats for representing/modeling these highly-structured and semi-structured information. Each representation approach has pros and cons. Each of these formats has a standard query language:
- Table-type representation - queried using SQL
- Tree-type representation - queried using XQuery or JSONiq
- Graph-type representation - queried using SPARQL
Each of these representation approaches is 100% interchangeable. The key to performace is the indexing scheme, not the information format. Each of these formats has an indexing scheme.
Different database management systems are used to manage these different representation/modeling approaches. Some representation formats fit better into some specific types of database management system. Each database management system likewise has pros and cons for representing information. These pros and cons tend to distill down into the following constraints which must be considered (i.e. other constrains may need to be considered based on some specific situation, but these constraints generally should always be considered):
- Ease of use for business users (as business users manage this information, not IT people)
- Level of query power and sophistication
- Query performance, speed
- Initial system cost and ongoing operating cost
- System management and maintenance cost
- Ability to interoperate with other business systems internal or external to organization
- System flexibility, agility, adaptability by business users, not the IT department
- System scalability
There can be a need to make tradeoffs if a system cannot meet 100% of these constraints. Historically systems seemed to be stratified by the two primary categories: online transaction processing (OLTP) and online analytical processing (OLAP).
Another extremely popular system is the electronic spreadsheet, particularly the very popular Microsoft Excel. But spreadsheets have issues. The disadvantages of spreadsheets are well understood. How to solve the problem of spreadsheets, which has been referred to as "spreadsheet hell", is less well understood.
Now, OLAP systems are increasingly making use of spreadsheets to allow business users to interact with information contained in OLAP systems using spreadsheets or pivot tables. People are building other tools to fight the symptoms which are experienced by the users of spreadsheets.
But none of these approaches is solving the real problem of spreadsheets, they only fight the symptoms which business users and IT departments experience. What if spreadsheets were totally reengineered? What would the requirements be?
Well, here is my summary for the requirements of a better spreadsheet. This new improved version of a spreadsheet is not intended to replace 100% of all existing spreadsheets. Rather, this is intended to be a new alternative, a new category of spreadsheet. An alternative which could be used in 20% of the cases where more control is needed over spreadsheets.
- Readable by both humans and machines: A spreadsheet should be readable by both humans and machines. Information provided within a spreadsheet should be more a representation of information than presentation oriented. The representation can be presented in sheets, rows, columns, and cells but this is done leveraging information metadata and commonly understood patterns. 100% pixel perfect renderings are specifically not a requirement.
- Global standard format: The format of the spreadsheet should be a global standard, not controlled by one software vendor.
- Agreed upon level of semantics: The creators and consumers can agree on the level of semantic clarity they will make use of for a spreadsheet. The spectrum can range from no semantics at all (which is similar to today's spreadsheet) or a high level of semantics expressed by a highly controlled representation model.
- Separation of representation and presentation: The "representation" and the "presentation of the representation" should not be intermingled.
- Business rules separable from spreadsheet: Business rules should be separated from the information when desired, integrated with the spreadsheet when necessary. Business rules which are external to the spreadsheet can be used to "watch over" the things and relations within the spreadsheet. The business rules can be made available publicly via a URL, privately via a private URL, etc.
- Managed global standard: The better spreadsheet should be a global standard under the control of someone like OMG, XBRL International, ISO, Apache OpenOffice, or some other such organization.
- Provide a formal shape but be domain neutral filler: One formal shape should be agreed to, for example the multidimensional model, but the pieces which fit into that shape or "fill" the shape are domain neutral, controlled by the business domain.
- Format should allow for versioning, collaboration, etc: The syntax format should allow for ease of versioning, constructing systems which are collaborative in nature (multi-user).
- Straightforwardly usable over the Internet: The format should be compliant with internet standards.
- Support a wide variety of common business use cases: A wide variety of common business use cases would be served, but it is not a goal to solve every business problem which exists.
- Highly limited options: The number of optional features is to be kept to the absolute minimum, ideally zero. Multiple approaches to solving a problem are not necessary when one will do.
- Formal and concise design: The design must be formal, concise, well designed and well engineered.
The spreadsheet I am describing is somewhat of a "micro system". It allows business users to construct robust systems using what amounts to well engineered Lego-type building blocks created by IT professionals. This avoids or reduces information rekeying. This avoids or reduces the ever expanding "point solutions" which are created using spreadsheets. These micro sytems also will easily integrate into larger business systems.
I cannot really take credit for any of these ideas. I am simply summarizing the ideas of others into one consolidated list. There is a tremendous amount of information here. The best way to get your head around this information is to read this blog post, go read through the details of the other three blog posts mentioned at the beginning, and then read this blog post again.
Is this achievable? Is this even useful? What do you think? Personally, I believe XBRL is already this standard. Go look at SEC XBRL financial filings. Each report component is a spreadsheet.