Terabyte disks cost less than $50. If historical trends continue, then in 2022 you’ll be able to buy petabyte disks for a similar price. But it’s unlikely you’ll own one, because by then most of the world’s data will have waltzed its way into the cloud. A petabyte disk would be able to hold about a million full length movies, if mankind had actually made a million movies, but we haven’t. We’ve only made about 800,000, according to imdb.com. Movies consume disk space voraciously. They gobble up more space than sound recordings, which consume more megabytes than photos, which in turn dwarf text files, which in turn are hungrier for disk space than structured records stored in databases.
The Internet is voracious. It chews up about 6 exabytes of storage space (in 2010) But it’s less than 1 percent of all the data that’s stored, which now totals just under 1 zettabyte (i.e. 1000 exabytes or a million petabytes). Data growth has averaged about 60 percent per annum since the first computer stored the first file on the first magnetic tape, which suggests mankind will be storing a yottabyte of data by 2025. And if you really get a kick out of extrapolation you’ll be delirious to know that if that growth rate continues indefinitely then by the year 2325 we’ll be storing more bytes of data than there are atoms in the universe. Or not.
The Signal And The Noise
Bombyx mori silkworms make silk. You may know that. But 2000 years ago it was a secret and you wouldn’t have known it. It was an iron secret. Silk was incredibly valuable and the chinese merchants that made it kept the knowledge to themselves. They kept the secret for well over 2000 years. It is the best kept secret known to history, by a long long way.
Clearly, some information is supremely valuable. But it used to be very difficult to store. Before the invention of the first automated information technology (printing), the storing of information was achieved in just two ways:
- Writing it down by hand. Whole monasteries existed in the middle ages to do nothing more than reproduce copies of The Bible. They were human xerox communities.
- Storing it in the memory of appropriately trained people. That’s why epic poems (such as The Iliad) were composed in iambic pentameter, so that they could be remembered accurately.
We did the best we could to store valuable information. But now it’s easy and costs almost nothing. Consequently, some of the information we store has no real value. For example, the world’s population of security cameras is recording petabytes while you read this sentence. More than 99.999 percent of that data is utterly useless and will be discarded after a few days or weeks. Most of the cameras themselves will never record anything worth retaining. But that doesn’t make the security systems they serve useless. It’s one of those situations where the signal to noise ratio is extraordinarily low, but nevertheless it is worth tolerating the noise for the sake of the signal. The value of the signal is high and, at 10 cents a gigabyte, we can afford to store the noise for a while.
Other examples of low signal to noise ratios include log files on computers (which you only care about when something goes wrong), archived email (which you hardly ever care about) and most data backups (which are rarely ever called up to strut their stuff.) If you proposed that such data shouldn’t be recorded you’d probably get fired for stupidity. The value of the data is in its potential use.
Business Intelligence and the Big Data Heaps
Every now and then a new area of application comes along which requires the aggregation of a huge heap of data. Internet search was perhaps the first application to become prominent in that context. To search the Internet, you first had to read it all, and though there were and still are thousands if not millions of sites that get almost no traffic The search engine still needs to read those faded web pages just in case someone somewhere wished to retrieve the information they contain.
And then there are the click streams. On web sites, businesses want to know who clicked on what and why. The only way to know is to collect every click and analyze it all, even though many of the clicks have no real relevance to improving web site effectiveness. Again we tolerate the noise for the sake of the signal.
Possibly the largest heaps of data come from social networks where hundreds of millions of people are being tracked every day. A particular example here are the multiplayer Internet games where everything every user’s avatar does is recorded. As far as I know, no-one is recording all the images yet, but if storage gets cheap enough it will happen in time.
The formula is simple enough:
Cost of Storage + Cost of Analysis < Value of Analysis
then the data will be stored.
As the cost of storage falls by data falls by 60-75% each year and so does the cost of analyzing it, the left hand side of this equation reduces all the time, making it worthwhile to analyze what was previously ignored.
There is no obvious end to this situation, as long as the costs keep falling. It’s possible that we might run out of things to analyze, but you only need to talk to the very large database (VLDB) vendors to discover that business is booming rather than waning. There is no shortage of applications.
So back to the original question: Do we need all this data?