Jim Gray from Microsoft's Research Labs gave a talk at eBay Park in San Jose. He gave a very interesting talk on ongoing work to bring scientific data online.
What follows are my notes, as I typed them up during the talk:
This talk is about Informatics - How do we take information from related disciplines and make it available to other disciplines. Funding agencies are forcing scientific literature into the public domain. Lots of scientific data online, but not easy to access.
Thousands of years ago, science was empirical, and only in the last few hundred years has it become theoretical. In the last few decades it has become computational, and today, it is about data exploration (eScience), or informatics
Today, scientists have been able to collect in incredible amount of empirical data. There is a data avalanche. Astronomers can generate terabytes of data overnight. Radiologists have more pixels to work with than they can examine with the human eye.
Historically, computational science has been about simulation, but now it is about analyzing the massive amounts of data that scientists can collect. Computer scientists can help with visualization, data mining, and more. One of the things the computer science community can do is to help with tools that make this easier.
Data access is hitting a wall. The current practice is to dump data via FTP. This doesn't scale when you hit the multi-gigabyte and terabyte realm. You begin to need indexes and better organization. We need algorithms that don't have quadratic behavior (they are approximate instead of absolute) in order to handle this scale.
Most computer scientists don't use databases, so what do we need to do to make databases more useful for the scientific community? Doing statistical analysis within the database instead of outside.
Every scientist discipline is growing an informatics branch. This is not a small change.
Today there is a wall between developing scientific literature and the data on which it was based.
I [Jim] think that this change is going to happen. Data is going to come online. Computer science and physics literature is already online (archive.org). With other disciplines this is not the case (example of finding an article through pubmed but having to pay $50 to read it).
About 2 years ago, congress enacted a law that if anyone does research with public dollars should publish it in the public domain. Only 4% comply. Joe Liebermann is sponsoring a bill called Taxpayer Access to [Public Research?] This is going to happen eventually.
So how does the "New Library" work? Who pays for it? The big question is getting it into the archive. Running the website is not the problem. Curation is not the problem.
6,000 line XSD file for scientific articles. Very complex - we [Microsoft, presumably] are fixing Microsoft Word so that it can be used to write these documents.
Archive information gets stale unless it is accessed, so access is critical. I think advertising can fund access.
Cost of adminstering storage is 10X to 100X the cost of the hardware. LOCKSS (Intel and HP lap project) about keeping multiple copies of data securely in lots of different locations.
National Library of Medicine - well run government organization. Runs Pubmed. Data of Pubmed is federated through web services. Drop a document in one instance, and everyone else gets it. NCBI is going 99% of the work.
How does the publication flow work? How does it encompass "gray literature" like conference proceedings? Journals could work by being aggregators of published papers. "Increasing their page rank."
Why not a wiki? They are great, but a peer review system is different.
Why am I telling you this? "Library Science" is largely dead in the academic world, but what I am talking about is basically library science. Library science needs to be reborn for the digital age.
How can people explain data problems? Science these days is not reproducible. NASA has developed a way to organize their data. Primary data (level 0) is the raw science data, and on top of that you need to record all of the meta data (programs, algorithms).
My vision is a federated model of storing all the level 0 data and all the metadata on top of it throughout the world [the Pubmed model].
Science is figuring out how to objectify knowledge. What is an atom? What is a star? What is a molecule?
Best example of developing a scientific ontology - Entrez-GenBank.
I tell the scientific community, that if you can give us a class, we can take it and publish it and build a web service out of it.
Closing: most data is, or could be, online. Astronomy data is a great case - it is useless, it has no commercial value, so there aren't issues with sharing. SkyServer.SDSS.org - website to allow high school kids to play with the data from one of the worlds best telescopes. 150 hours of online astronomy. 10% kids and 90% astronomers. SkyQuery.net - federates astronomy data from 15 archives.
