Forum Moderators: Robert Charlton & goodroi
Google's synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein's theories about how words are defined by context.
Sometime in 2001, Singhal learned of poor results when people typed the name "audrey fino" into the search box. Google kept returning Italian sites praising Audrey Hepburn... "We realized that this is actually a person's name," Singhal says. "But we didn't have the smarts in the system."
...he had to master the black art of "bi-gram breakage" — that is, separating multiple words into discrete units. For instance, "new york" represents two words that go together (a bi-gram). But so would the three words in "new york times," which clearly indicate a different kind of search. And everything changes when the query is "new york times square."
[edited by: tedster at 11:58 pm (utc) on Feb 23, 2010]
All of this led me to expect that the number of indexed pages would be greatly expanded after Caffeine was implemented.
Personally, I would think they would want more data to compute from with less but more accurate pages returned in the index (SERPs).
Today it's estimated that [a single Google query] travels across 700-1000 machines, a figure that has nearly doubled since 2006 perhaps due in part to the introduction of Google Universal.
[blogoscoped.com...]
The opportunities for a big disconnect between what Google intends to happen and what really happens in a brand new infrastructure with a rewritten file system would be extreme.
Here's a few more references:
1. Our short discussion The Google Search Query - a technical look [webmasterworld.com]
2. The new domain Google began using in Q4 of 2009 = 1e100.net [webmasterworld.com]