Cosmic Convergence

January 5, 2012

I have a cute diagram for you. But first, as Frankie Howerd would say, The Prologue. Some background is necessary in case you get the wrong take-home message once I unveil the interestin’ picture…

Periodically somebody frightens us with tales of how The Data Deluge Threatens Science. (Yes, guilty.) Cynics will suggest that Moore’s law runs even faster, so computers will always be good enough – whats the problem? Actually, I think the interesting things are those that are not improving exponentially – last mile bandwidth, disk I/O speed, and human resource. The first two are a bit of a bummer for individuals, but fixable if you are a pro data centre – tune your TCP buffers, open multiple channels, hang thirty discs off your motherboard, etc. So this drives us to a service architecture. But so does the third thing : people effort. Developing interfaces to the data, and curating it as it comes in, takes work. A lot of this scales per archive rather than per bit. The real problem, and why we need the VO, is the number of archives not the number of bits. Its a Tower of Babel thing.

Enough of the VO lecture. The size of each major astronomical archive certainly is growing fast. But so is the size of a typical hard drive. As part of some work with Bob Mann and and Mark Holliman, I was collecting some data on these things, when suddenly it occurred to me not just to talk in general terms but to actually plot one on top of the other. Exhibit A therefore shows (i) the evolution of the size of PC disks, from the wikimedia commons page here , and (ii) the total size of the ESO Science Archive, divided by 200 – data kindly provided by Paolo Padovani. The step function in 1999 is real – its when the VLT switched on. Otherwise, they track each beautifully. Over two decades, while volumes have increased by factors of tens of thousands, the number of PC drives needed to hold the ESO data has stayed the same within a factor of two.

I guess both of these things, like a number of others, are loosely driven by a combination of integrated circuit technology and economics. But the fact that they are so close seems surprising. Maybe there is some kind of weird technology central limit theorem thing going on.

Comparison of growth of ESO archive with hard drive capacity. Dik data used under GPL.