Forum Moderators: Robert Charlton & goodroi
We are excited to announce we were able to sort 1TB (stored on the Google File System as 10 billion 100-byte records in uncompressed text files) on 1,000 computers in 68 seconds. By comparison, the previous 1TB sorting record is 209 seconds on 910 computers.Sometimes you need to sort more than a terabyte, so we were curious to find out what happens when you sort more and gave one petabyte (PB) a try. One petabyte is a thousand terabytes, or, to put this amount in perspective, it is 12 times the amount of archived web data in the U.S. Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.
It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers. We're not aware of any other sorting experiment at this scale and are obviously very excited to be able to process so much data so quickly.
Just sorting that amound of data is quite something. As they indicate, when you've sorted it, where do you put that information! Apparently, 48,000 hard drives.
Now, I can appreciate this is not the sort of thing you or I could do with an off-the-shelf PC (even if I turn off Vista's CPU-hogging sidebar and close ALL my programs!). What if I happen to acquire 1PB of data, and need it sorted. Can I hire Google to do it?
And can I please get the sorted results back as an Excel file?
This story is like nerd erotica
In the thread you posted, you asked: "Any ideas on when it will apply to the data pushes? Do you think the data pushes will happen more often, more updates?"
Google's article says:
By pushing the boundaries of these types of programs, we learn about the limitations of current technologies as well as the lessons useful in designing next generation computing platforms. This, in turn, should help everyone have faster access to higher-quality information.
So I'm guessing that they are planning on using their learning in-house, to the degree that it is practical. This time trial is apparently an enhancement of their MapReduce [labs.google.com] program, which they've been using in some form or another since 2004.
To make sure we kept our sorted petabyte safe, we asked the Google File System to write three copies of each file to three different disks.
Ok "<specific programming language error code here>" is not exactly a good measuring stick but it seems like it's been worse before.
I just wish they could put this computing power to use to stop the bloodshed and suffering the world, rather than just running down their search results for the sake of it. :(
I think there's plenty of practical value involved in speeding up computing processes - such as the ability to fold more complex intelligence into the algo. More complexity requires more computing, and a lot of Google's innvoation in recent times has come from just such advances. You want better duplicate handling? It takes faster computing cycles to make that happen. Same thing with catching and dumping spam.
Considering i'm still getting 'Dreamweaver Extension' marketing emails weekly, I can't expect the argument of usability for designers can be much of a stretch > mom and pop can use software that outputs pages as binary. Voila, huge amounts of bytes saved in crunching.
Someone, who is more geeked out, please explain why this isn't a pressing issue like electric cars. So much electricity, hardware, etc. to be spared, why is WWW not in binary?
Happy research: Internet History [freesoft.org]; The WWW Project [w3.org]
Note that this sorting record that Google set did use "100-byte records" - and that's getting pretty small, but yes, it's still more that just one bit.
I feel that the search market, as it is now, could easily be turned on it's head in a heartbeat and Google, Yahoo and MSN could be relegated to being good conversational topics but nothing more.
How?
Some enthusiastic college student will develop a new browser that incorporates what PEOPLE really want, not what pleases investors and advertisers, and people will flock to that. (sorry Google, Chrome isn't it). This same enthusiastic kid will also realize that since everyone wants HIS/HER browser more than any other that the major search companies should pay HIM/HER handsomely for the privilege of being the default search engine.
That possibility has to give top executives nightmares, I'm sure each company has a vault of money ready to throw at such a kid (assuming they can't influence their way into the project in advance)... and I can't wait to see it happen. It will take timing, perhaps a significant advance in computer technology timed perfectly with the new browser but it's very possible and perhaps even probable.