I have a cute diagram for you. But first, as Frankie Howerd would say, The Prologue. Some background is necessary in case you get the wrong take-home message once I unveil the interestin’ picture…
Periodically somebody frightens us with tales of how The Data Deluge Threatens Science. (Yes, guilty.) Cynics will suggest that Moore’s law runs even faster, so computers will always be good enough – whats the problem? Actually, I think the interesting things are those that are not improving exponentially – last mile bandwidth, disk I/O speed, and human resource. The first two are a bit of a bummer for individuals, but fixable if you are a pro data centre – tune your TCP buffers, open multiple channels, hang thirty discs off your motherboard, etc. So this drives us to a service architecture. But so does the third thing : people effort. Developing interfaces to the data, and curating it as it comes in, takes work. A lot of this scales per archive rather than per bit. The real problem, and why we need the VO, is the number of archives not the number of bits. Its a Tower of Babel thing.
Enough of the VO lecture. The size of each major astronomical archive certainly is growing fast. But so is the size of a typical hard drive. As part of some work with Bob Mann and and Mark Holliman, I was collecting some data on these things, when suddenly it occurred to me not just to talk in general terms but to actually plot one on top of the other. Exhibit A therefore shows (i) the evolution of the size of PC disks, from the wikimedia commons page here , and (ii) the total size of the ESO Science Archive, divided by 200 – data kindly provided by Paolo Padovani. The step function in 1999 is real – its when the VLT switched on. Otherwise, they track each beautifully. Over two decades, while volumes have increased by factors of tens of thousands, the number of PC drives needed to hold the ESO data has stayed the same within a factor of two.
I guess both of these things, like a number of others, are loosely driven by a combination of integrated circuit technology and economics. But the fact that they are so close seems surprising. Maybe there is some kind of weird technology central limit theorem thing going on.
I think there might be a cause & effect thing kicking in here. The data volume grows at the same rate as the hard drive capacity, at least to first order(ish), because the people designing the instruments assumed that by the time they were finished hard drives would be a certain size if they continued to grow in size as they historically always had done.
The correlation might not be chance, it might be by design.
Alasdair – that must be part of the answer, but it still surprises me. For example VLT has many instruments with quite different data rates, and we wouldn’t know in advance which would dominate; and its not clear that instrument designers always try to max out their data volume; that would be part of the compromises with other expense drivers. There are many different factors. Hence my “bit like central limit theorem” niggling feeling.
Predicted data rates and decisions on which instrument/detector modes to officially offer to the community (i.e. those that get automatically archived) are all part of the PDR/FDR process when you’re building an instrument for ESO. So the correlation is most definitely planned. They will decline to offer high data rate detector modes if they think there’s only a very narrow scientific justification, for the sake of the archive size.
Interesting blog post!
I did some trend fitting on this a while ago for fun. I found that pixel counts of optical surveys are increasing by about +26% per year, which is even slightly slower than the increase in disk I/O which has been +28% per year since ATA1.
Pixel counts: http://dl.dropbox.com/u/30396109/ccdexp.pdf
Disk I/O: http://dl.dropbox.com/u/30396109/diskspeed.pdf
In contrast, Moore’s law is hanging in at +42% / yr (transistor counts) while I found that UK+US academic network backbones grew at +57% / yr since 1993. I hope to post these graphs on arXiv soon.
I suppose optical pixel counts are growing so slowly because they are limited by telescope aperture, rather than transistor tech?
I’d agree that the increasing diversity and heterogeneity of projects is a different challenge, which does make a strong case for the VO!
Geert – very interesting, thank you. Your graph on disk I/O might seem discrepant with my claim that this hasn’t improved, but in fact we are talking about different things. Max bandwidth from sequential reads has improved, but in most real world circumstances you are limited by seek time and latency. These have improved by only about a factor 3 over twenty years, as shown here
That’s a good point! I suppose one could argue that solid state disks will solve latency, but then surely another bottleneck could be named (e.g. single memory bus for N cores?)
Also, max I/O bandwidth increasing at the same speed as the number of survey pixels is not something to be happy about; I want my code to have quick and easy access to all of a survey’s pixels 🙂
It will be interesting to see how ALMA affects this. Those data sets will be so much larger than what we’ve had it is a real problem just knowing how to handle the data.
…at the moment ALMA appear to be addressing this problem by not releasing the data to the community.
and from what i hear – EVLA have solved it by only retaining 1 second from every 3 seconds of integration.
Sadly, the amount of info (even communication) from ALMA does seem extremely stunted, I agree.
its getting beyond a joke – i know data was taken ~3 months ago for a project i’m on – but no sign of the data yet (it’s not complicated observations: just 2 minute continuum maps). plus i was told that i shouldn’t have been told that data had been taken… its all a little weird.
Is that what they mean by “data reduction”?
I remember many years ago Jasper Wall pointing out that there was a similarly precise agreement between the time dependence of the price of fish in London and the size of Chinese feet. Coincidences do happen.
It’s a pity you don’t have ESO data prior to 1990, since that’s when the size of disks takes off. This was for a completely different technological reason than the factors driving numbers of transistors, so one might have expected different trends at early times. At later times, it’s plausible that disk capacity is a factor: we didn’t get bigger CCDs, but just mosaiced more of them together, and the size of a mosaic might be driven by whether you think you have space to store the stuff.
Going back to disks, I’m very proud to have been an undergraduate classmate of Stuart Parkin. He was the guy who turned the basic phenomenon of giant magnetoresistance into a practical technology that permitted tiny but sensitive disk heads. There was an “ipod nobel” in 2007 for giant magnetoresistance, and many think Stuart should have shared in it. Having spent my career dabbling harmlessly in astronomy, it’s humbling to see that a fellow student changed the world: imagine if we still thought 1GB was a big disk drive…
John – I already did that price of fish gag in the post before last .. I have a vague feeling it was a Bertrand Russell thing. What is the “completely different technological reason” you are thinking of ?
“Coincidences do happen.”
Indeed. There is a well known correlation between the decrease in old-style pirates and the increase in global mean temperature.
> What is the “completely different technological
> reason” you are thinking of ?
As I said, bigger disk drives were made possible only by Parkin’s development of microscopic materials displaying giant magnetoresistance in order to move away from the previous inductive hard disk heads. I’m not aware there was an analogous single technology revolution driving Moore’s Law in transistor numbers, but giant magnetoresistance was certainly nothing to do with it – which is what I was trying to say. Moore’s law was ripping along in the 1980s when disk capacity was static. Without Parkin, we’d have run into huge storage problems in the early 90s – although maybe then market forces would have brought about the present expansion in flash memory capability sooner?
I think John has a point. ESO instruments were capable of producing higher data rates than could be stored or processed, especially for infrared instruments. For instance, burst mode in the mid-infrared was limited to very short sequences. Either you process data on the fly and forget about saving the raw data, or you limit the data rate to values the network can cope with. Radio astronomy is a case in point, where the wide bandwidths and aperture arrays are pushing the data envelopes. Data storage and data processing are not the same thing, but the combination has been a major constraining factor for quite some time and I expect, will continue to do so.
This provides some interesting numbers. Note that a) this is just 1 per cent of SKA and b) in the first 6 hours it will have more data than all previous radio astronomy combined. (I am reminded of reading somewhere that all radiation ever collected and used for radio astronomy has less energy than a single snowflake; that might have to be revised in the future.)
I think there’s another feedback loop in operation that helps lead to this – that the observatory / data center can decide what to archive or not. Should we archive engineering data, or just science data? Should we store pipeline reduced data products in the archive, or just raw data? What about for “master calibration” type data? What about intermediate processing products? What about ancillary data such as weather and site monitoring data, or instrument / telescope telemetry, etc etc…
My point being that some of these decisions will be influenced by the cost and availability of the storage space. If disks are cheap and not full, then people will find stuff to put on them. If the disks are full and expensive, people will decide what’s really important to keep..