Data Doesn’t Know What It Is

by Robin Bloor on December 12, 2011

I’ll tell you a secret. Digital data doesn’t know what it is.

Pause for a moment and take a dollar bill from your wallet or purse, and look at it. If you want to know what data-that-knows-what-it-is looks like – you are staring at an example.

What information does this example of data-that-knows-what-it-is contain?

  1. It declares that it is a Federal Reserve Note, so it knows who created it. It even bears the signature of the Secretary of the Treasury at the time the bill was printed. And, in case you are interested, it confesses which branch of the federal reserve bank printed it.
  2. It further declares that it is a One Dollar note of The United States of America – in effect letting you know the geographical region in which it is legal tender. And in case you had any concern about its fungibility, it announces directly to you that “This note is legal tender for all debts, public and private.”
  3. It bears a unique serial number of 8 digits, which is positioned between two letters. This serial number really is unique. So if you happen on two notes with the same serial number, you know for sure that one is a forgery.
  4. There is a whole set of data on the note to help you identify forgeries, from the intricate design through to the material it is made of (a blend of paper, cotton, silk and linen) and the ink used to print it.

A dollar bill is not worth a dollar in respect of its physical value. It costs about 4 cents to print, so intrinsically it is worth 4 cents no matter whether it has a face value of $1, $5, $10, $20, $100. However, because it bears a guarantee from the Federal Reserve Bank, it’s face value will be honored if challenged.

So what do I mean by data-that-knows-what-it-is? 

I mean data that explains itself to anyone or thing that seeks to process it. Paper currency is data of that ilk.

However, most electronic data is not of that ilk at all. If you select an item of data or a collection of items of data and just throw it at a program, it will not know what to do with it, and more importantly, it will not know what the data is. If you examine a data file you have no certain knowledge of which program it came from. It is a lonely orphan carelessly abandoned by its parent program and, unless it is a well defined file type, such as MP3 for example, you will have no idea what it contains.

XML To The Rescue – Almost

XML was defined in 1997 and unleashed on the world in 1998. It was quickly adopted for displaying data on web pages – and then some intelligent IT folk realized that, with XML, a collection of data could declare what its metadata was. This was a small step for IT but a giant leap for orphaned data. It was the first baby step towards data knowing what it is.

Before we delve further into that, let’s consider the way that data is currently passed from executable to executable when XML is not used. One program creates a file and another program can use that file only if it knows the metadata of the file. It may be able to detect a character string or a number, but has no idea at all what the data represents. If it happens to know the file contains financial transactions it is still none the wiser. If the file carried its metadata with it, it would at least be able to process the data, but that wouldn’t be enough, because it would not necessarily know what to do with it.

Such data problems were improved by Web Services standards, to some degree, but not completely. WSDL (the Web Services Description Language) made it possible for a program to know what to do with data sent to it.

But That’s Still Not Good Enough

Consider a photograph. Nowadays it is a digital file, and digital cameras ensure that a good deal of metadata is attached to the photo; the type of camera, date and time, etc. Nevertheless a digital photo is still short of a good deal of information. For example, doesn’t know who owns it, who has the right to look at it and who has the right to publish it. It could carry that information too. Data that knew what it was would carry that information. And notice, by the way, that such data is different to data passed from one program to another in a Web Services interaction, because a digital photo is not dependent on a specific program for its context. It can exist in its own right as an electronic artifact and it can be processed by many programs.

Now let’s think about electronic money. And be sure to understand that I am not talking about records of bank balances locked away in some banking system somewhere – because that isn’t electronic money, it’s an electronic description of some money somewhere. Electronic money doesn’t exist right now, but it could. It could exist like a digital photo as an electronic artifact that one person could exchange with another.

And how would that improve the world?

It would change the world completely. You could have money that knew who owned it (and hence it could not be stolen.) You could have “underage money” to give to children as pocket money; money that could not be spent on cigarettes or alcohol. You could constrain money in many ways; so that it could only be spent on food, or only spent on gasoline or only spent on travel. This is all feasible and, most likely, it will come to pass in one way or another. But before that happens we have to have data standards which ensure that data can know what it is and circulate freely.

{ 1 comment… read it below or add one }

Arun April 26, 2012 at 8:43 pm

Looking forward to the changed world where data knows what it is!

Since you mentioned digital photographs, some of the most important metadata it can carry is the color profile.
(see http://en.wikipedia.org/wiki/ICC_profile )
All software, including web browsers ought to implement the use of the color profile. This enables as high a fidelity as possible in the rendering of the colors of a photograph.

With regard to data and metadata, an analogy from communications suggests itself – the control or signalling plane and the data or bearer plane. WSDL is an example of in-band signalling. The whole problem of metadata, revolving around relational databases, is that the signalling plane is separate and is for most implementations, not explicit. Perhaps the relational database can be engineered a bit more to allow by default storage for the non-technical metadata, as well as to offer technical + non-technical metadata as a service; any data from the database could include a pointer back to the meta-data service.

Reply

Leave a Comment

Previous post:

Next post: