Google Wants to Store Your Genome
Google is approaching hospitals and universities with a new pitch. Have genomes? Store them with us.
The search giant’s first product for the DNA age is Google Genomics, a cloud computing service that it launched last March but went mostly unnoticed amid a barrage of high profile R&D announcements from Google, like one late last month about a far-fetched plan to battle cancer with nanoparticles (see “Can Google Use Nanoparticles to Search for Cancer?”).
Google Genomics could prove more significant than any of these moonshots. Connecting and comparing genomes by the thousands, and soon by the millions, is what’s going to propel medical discoveries for the next decade. The question of who will store the data is already a point of growing competition between Amazon, Google, IBM, and Microsoft.
Google began work on Google Genomics 18 months ago, meeting with scientists and building an interface, or API, that lets them move DNA data into its server farms and do experiments there using the same database technology that indexes the Web and tracks billions of Internet users.
“We saw biologists moving from studying one genome at a time to studying millions,” says David Glazer, the software engineer who led the effort and was previously head of platform engineering for Google+, the social network. “The opportunity is how to apply breakthroughs in data technology to help with this transition.”
Some scientists scoff that genome data remains too complex for Google to help with. But others see a big shift coming. When Atul Butte, a bioinformatics expert at Stanford heard Google present its plans this year, he remarked that he now understood “how travel agents felt when they saw Expedia.”
The explosion of data is happening as labs adopt new, even faster equipment for decoding DNA. For instance, the Broad Institute in Cambridge, Massachusetts, said that during the month of October it decoded the equivalent of one human genome every 32 minutes. That translated to about 200 terabytes of raw data.
This flow of data is smaller than what is routinely handled by large Internet companies (over two months, Broad will produce the equivalent of what gets uploaded to YouTube in one day) but it exceeds anything biologists have dealt with. That’s now prompting a wide effort to store and access data at central locations, often commercial ones. The National Cancer Institute said last month that it would pay $19 million to move copies of the 2.6 petabyte Cancer Genome Atlas into the cloud. Copies of the data, from several thousand cancer patients, will reside both at Google Genomics and in Amazon’s data centers.