Forum Moderators: Robert Charlton & goodroi
As it invites the world to play in a mysterious sandbox it likes to call "Caffeine," Google is testing more than just a "next-generation" search infrastructure. It's testing at least a portion of a revamped software architecture that will likely underpin all of its online applications for years to come.Speaking with The Reg, über-Googler Matt Cutts confirms that the company's new Caffeine search infrastructure is built atop a complete overhaul of the company's custom-built Google File System, a project two years in the making. At least informally, Google refers to this file system redux as GFS2.
"There are a lot of technologies that are under the hood within Caffeine, and one of the things that Caffeine relies on is next-generation storage," he says. "Caffeine certainly does make use of the so-called GFS2."
Reported at The Register
[theregister.co.uk...]
There's an earlier Register piece from Wednesday [theregister.co.uk] that goes more in-depth about GFS2 (Google Filesystem 2). The following quote is from Page 2:
[The original GFS approach of a] single master can handle only a limited number of files. The master node stores the metadata describing the files spread across the chunkservers, and that metadata can't be any larger than the master's memory. In other words, there's a finite number of files a master can accommodate.With its new file system - GFS II? - Google is working to solve both problems. Quinlin and crew are moving to a system that uses not only distributed slaves but distributed masters. And the slaves will store much smaller files. The chunks will go from 64MB down to 1MB.
Matt Cutts is the man who oversees the destruction of spam on the world's most popular search - the PageRank guru who typically opines about the ups and downs of Google's search algorithms. So, on Monday afternoon, when Cutts posted a blog post revealing a "secret project" to build a "next-generation architecture for Google's web search," many seemed to think this was some sort of change in search-ranking philosophy. But Cutts made it perfectly clear that this is merely an effort to upgrade the software sitting behind its search engine."The new infrastructure sits 'under the hood' of Google's search engine," read his blog post, "which means that most users won't notice a difference in search results."
On another thread [webmasterworld.com...] there was some discussion about the similarities of "old g" and "new g". I think this answers that question.
Caffeine is about the search index. But GFS2 is designed specifically for applications like Gmail and YouTube, applications that - unlike an indexing system - are served up directly to the end user. Such apps require ultra-low latency, and that's not something the original GFS was designed for.
Wave being one of those ultra-low latency applications too presumably.
I'd bet that a lot of PCIe connected Solid State Drives are being used to do a lot of the things described (and inferred) - knowing Google it's probably designed specifically for them.
If memory is limiting the number of chunks that masters index then you either need to massively increase memory (at a huge cost) or you need to introduce near-memory speed directly connected storage with ultra low latency - which is the definition of Enterprise PCIe Solid State Drives (which are huge and cost efficient for the tasks they are used for).
Having tested such devices I'm of the opinion that they are game changers, in that they will completely change the way people code, and if I'm right that Google master servers are using them then it may already be coming to pass.
It's scary how fast storage systems will be in 3 years (Enterprse now will be commodity by then) - it's time to prepare for that transition now - other people already are.
Google have the cash (and talented people) to do all sorts of amazing things - and the storage hardware is catching up with their ambitions. Data processing on a massive scale relies on overcoming many potential weak points, the largest issue used to be the limitations of mechanical hard drives (not a problem when spooling large files but a big issue when you try to get latency down); that issue has been solved and the processing power, network infrastructure, distributed computing algorithms and massive datacentre build outs are all in place... game on.
just as Yahoo!, Facebook, and others are working to improve the speed of Hadoop - the open source platform based on MapReduce - Google is eternally tweaking the original
I'd bet that a lot of PCIe connected Solid State Drives are being used to do a lot of the things described (and inferred) - knowing Google it's probably designed specifically for them.
Good thinking. I'd say that you were pretty much bang on the money with that, as it all ties up quite nicely:
[informationweek.com...]
I was thinking the same when I read this again. Tweaked SSDs should be far superior to anything that's spinning. In fact, I would think this is an economical solution for Google despite the cost of SSDs.
Solid State Drives
I agree that SSD's are going to change computing, especially web servers, in a huge way.
However I would be willing to bet that Google's "large quantities of inexpensive machines" philosophy means they have not yet made the switch to SSD.
No doubt they will start plugging them in when prices come down some more, but for the moment I suspect their HDs are still spinning.
Intel gains SSD orders from Google [digitimes.com]
[edited by: tedster at 7:02 pm (utc) on Aug. 16, 2009]
It would be easy to assume that Google are using SATA Intel SSDs like the X25 M or E - but (as the article speculates) it's almost certainly a bespoke Marvell PCIe conroller with Intel sourced flash chips (given Marvell's June 2008 press release on their PCIe SSD controller). It'll be interesting to see if they stick with them or consider that other PCIe vendors.
It's interesting that the well known aversion to expensive hardware has been sidestepped by Google here - the benefits of Enterprise SSDs are too big to ignore.
Google - Caffeine
Search Phrase-----------------SERP Pos.------URL------------# of Results
3 word phrase 1 (singular)-----84----------Product Page------2,690,000
3 word phrase 1 (plural)-------3-----------Home Page---------3,760,000
4 word phrase 1 (singular)-----1-----------Home Page---------2,420,000
4 word phrase 1 (plural)-------3-----------Home Page---------3,140,000
Searches were done w/ in 5 minutes total.
The same 3 word and 4 word phrases were used on both the current Google search and the Google Caffeine search.
Note the varying # of Results, SERP Positioning and Landing Page
The disparity between the # of Results and SERP Positioning becomes even greater when using some of our "seasonal" search phrases.
Anyone else seeing this? Even though they claim that it doesn't effect SERP's
Thanks for the data on this. The changing total number of results is particularly interesting, i think, because it is more reflective of an infrastructure change. We do need to remember, however, that the number reported is only a rough estimate. Still, there may be some clues in there.
Google - Now
Search Phrase----------SERP Pos.----# of Results
Phrase 1 --------------->100---------35,300,000
Phrase 2 --------------->200----------1,310,000
Phrase 3 -----------------86------------990,000
Phrase 4 ------------------5------------598,000
Phrase 5 -----------------84----------1,130,000
Phrase 6 ----------------161----------1,060,000
Phrase 7 ------------------5------------566,000
Phrase 8 ----------------159---------34,800,000
Google - Caffeine
Phrase 1 -----------------19----------8,080,000
Phrase 2 -----------------11----------9,970,000
Phrase 3 -----------------11----------1,350,000
Phrase 4 ------------------3----------1,130,000
Phrase 5 -----------------22----------1,340,000
Phrase 6 -----------------21---------20,100,000
Phrase 7 ------------------2----------1,030,000
Phrase 8 -----------------15----------8,060,000
10 minutes differential on these searches.
Think of the old algo like a scale with one fulcrum... now with the new architecture they have many more fulcrums for fine tuning of results based on their analysis of user motive and all of the new integrated search verticals - video, news (results based on time relevancy), profiles etc...
I also think that G is very concerned about market share and competition. They have been sitting on the caffiene update for a bit - waiting for a bump like bing... As soon as the market share started changing - out came the announcement.
I wonder if they have really been working on caffiene vs. the regular algo - thus all of the crap that is G's SERPs right now... less relevant results than ever IMO.