The Species of Data

by Robin Bloor on March 12, 2012

“Data” is a term that has attached itself to a growing number of words in the business intelligence space. Data management, database, data warehouse, data mining, just to name a few. The list is nearly inexhaustible. As pervasive as the term is, however, few  can have a good answer to the seemingly simple question: What is data?

We’ve already come at this question from two directions (see What is data?, and Why we should care about XML), but here’s a third perspective. Three’s the charm as they say.

In the general sense, a datum is the smallest element of information. It is the brick upon which foundations are laid. It is the atom to the molecule, the Amino Acid to the DNA strand. But it is, and should be viewed, as far more than that.

Data is the plural derivative of the Latin word datum, and it means “something given.”

So let’s think of data as a gift.

Data as Entity – Species #1

Consider a neatly wrapped package, tied with brightly colored bow. It is an entity unto itself, and in its unopened, restful state, it represents nothing other than a gift and its packaging. We can walk a circle around it and regard it from different angles, but it will remain unchanged. We can move it from one side of the table to the other, and yet it is still the same. The gift is a thing unto itself. It is an entity. And if we gather data about, as its state changes, we can assemble many snapshots of it that record the whole lifecycle of that entity.

Event Data – Species #2

Let us imagine that we interact with the gift. Pulling the tail of the bow releases the ribbon. Sliding your finger under the tape undoes the wrapping. Removing the lid from the box reveals the reality of the gift. We could gather data about each of these events. The one-time action of opening the gift is data of a different kind: data as an event.

An event occurs when one thing interacts with another.  The data regarding the event doesn’t have a life-cycle. It is always singular.

Data Collections – Species #3

But consider this gift. Let us say that it had a ribbon round it, it came in a box, and in the box there was also a greeting card, plus the intended present gift itself – a gilded scroll on which are inscribed Ted Codd’s 12 Rules of Relation Database, numbered from 0 to 12 because, as I’m sure you know, the 12 rules were actually 13 rules. Now it is conceivable that the receiver of such a gift might not consider it as a worthy possession. Consequently they choose to regift it to a friend they have, who happens to be a Relational bigot. They wait for an appropriate occasion and then repackage the scroll, adding an entirely different card and send it on its way. Alternatively they are righteous in their Relational faith and have Ted Codd’s eternal words of wisdom framed and hung on the wall.

Either way,  when we consider the gift entity we have to conclude that ribbons, boxes, cards and frames are really trappings and not the gift entity itself. Nevertheless, if we wish  to record the lifecycle of this entity, clearly it is going to be worthwhile to collect all of this data together. In doing so we have data a collection. Data as a collection, of course, need not always be a collection of things referring to a single entity. It could, for example, be a collection of all the things in my wallet; credit cards, business cards, bank notes, receipts and a solitary tooth pick.

Data Aggregations – Species #4

Some data collections are more useful than others. Data aggregations are data collections of data of the same type – for example an aggregation of all the instances of a gilded scroll displaying Ted’s 12 Rules being given as a gift, plus the information on who gave the gift, who received the gift and who of those receiving the gift actually regifted it or threw it away. It is possible by analysis of this data to discover who we should make such a gift to and who we shouldn’t.

We will note that, for example, all teenage girls receiving such a gift threw it away. We note that RDMBS enthusiasts kept it, framed it and displayed it. We discover that some receivers of this gift who understood nothing of database or even of IT, kept it because they appreciated the copperplate writing on the scroll and didn’t really give a damn what it said. And so on and so forth.

All Together Now

The neat thing about aggregations is that they often have useful information hidden inside them, which can be teased from them. In fact, when it comes down to it, aggregations are what BI is all about. So let’s consider our gifted scroll again, but this time as a database of the data surrounding it and all other gifted scrolls. We can quickly deduce that the database will have the form of a very simple snowflake schema. The event table would be the events of the gift being given. Attached to the event table are the entities of: the gift givers, the gift receivers and the gifts. And attached to the gifts is the data on all the packaging of each gift. So a simple snow-flake then as illustrated.

The Gifting Data Model

And Then There is Persistence

Finally and realistically, we must decide what to keep and what to throw away. This is almost the same as with the gift itself. It may be that we can use that brightly colored ribbon for another decoration. Let’s keep it. We may also find other uses for the box. Let’s keep it. We can put these in the cupboard where they will persist. As for the wrapping paper, it is unlikely to be useful again, its purpose having been spent. Let’s discard it. And the card – it didn’t even have a witty message in it.

The same is true of the data maybe we don’t consider some of this data of any relevance to anything. Maybe we don’t care about the packaging and we don’t keep data about it, because we didn’t care about the changes of state involved in unwrapping the gift.

The truth is that databases model events of the real world, but they do not have to model all the state changes of all the events that any entity undergoes. We may even choose to capture such data but not allow it to persist. There is data that persists and data that doesn’t. The persistent data is what we keep to use again in some other way.

So there we have it. What started out as a gift has now become a meaningful set of values. It has endured events, changes and processes. It has formed a whole database. It has enabled us to make projections and gain insight. And it has allowed us to keep what’s important.

Leave a Comment

Previous post:

Next post: