Forum Moderators: Robert Charlton & goodroi
...links may be weighted based on how much the documents containing the links are trusted (e.g., government documents can be given high trust). Links may also, or alternatively, be weighted based on how authoritative the documents containing the links are (e.g., authoritative documents may be determined in a manner similar to that described in U.S. Pat. No. 6,285,999) [patft.uspto.gov].
Clearly, Google has two different metrics going on. As you can see from the reference to Larry Page's original patent, authority in Google's terminology comes from backlinks. When lots of other websites link to your website, you become more and more of an authority.
But that isn't to say you've got trust. So what exactly is trust? Here's an interesting section from the same patent:
...search engine 125 may monitor one or a combination of the following factors: (1) the extent to and rate at which advertisements are presented or updated by a given document over time; (2) the quality of the advertisers (e.g., a document whose advertisements refer/link to documents known to search engine 125 over time to have relatively high traffic and trust, such as amazon.com, may be given relatively more weight than those documents whose advertisements refer to low traffic/untrustworthy documents, such as a #*$!ographic site);
So we've got two references here, government documents and high traffic! From other reading, I'm pretty sure that trust calculations work like this - at least in part. Google starts with a hand picked "seed list" of trusted domains. Then trust calculations can be made that flow from those domains through their links.
If a website has a direct link from a trust-seed document, that's the next best situation to being chosen as a seed document. Lots of trust flows from that link.
If a document is two clicks away from a seed document, that's pretty good and a decent amount of trust flows through - and so on. This is the essence of "trustrank" - a concept described in this paper by Stanford University and three Yahoo researchers [ilpubs.stanford.edu].
This approach to calculating trust has been refined by the original authors to include "negative seeds" - that is, sites that are known to exist for spamming purposes. The measurements are intended to identify artifically inflated PageRank scores. See this pdf document from Stanford: Link Spam Detection [dbpubs.stanford.edu]
To what degree Google follows this exact approach for calculating trust is unknown, but it's a good bet that they share the same basic ideas.
So let's all work to keep these two concepts distinct - trust and authority.
[edited by: tedster at 6:14 am (utc) on Dec. 3, 2008]
Sitelinks are not at all a sign of trust, and a relationship to "authority" is possible, but it's also weak.
Sitelinks are a navigational aid that Google assigns to a domain for some queries if the site's architecture is clear enough for them to do so. If sitelinks are displayed for a generic query, and not just a domain name query, then something about authority has come into play - but still, that's not trust.
Do you think trust is linked to topic, or would you say those are likely to be treated separately in an algorithm? Also, would a link from a high-trust source to a page also attribute to the total trustworthiness of a website? If so, I wonder if that would work based on internal linking (page1 gets a link from a trusted page on another website and links to page2, so page2 also gains trust), or perhaps on URL structure (domain.ext/dir1/page1.html is linked to from a trusted page on another website so all pages within dir1 get a bite of trust, and domain.ext will also get a (smaller) bite).
I've always thought of trust as one of the "query-independent" factors, like PR - that is, it's not related to any topic. So that makes another key differentiating point. Authority is related to a topic, and trust is not.
I'd say that trust is usually a domain-wide factor, rather than being confined to a url. The papers often leave this open ended. The Link Spam Detection paper, for instance, discusses analyzing "nodes", where "nodes may be pages, hosts, or sites..."
Given the wide variations available across the web (think of the public blogging domains, for example) there must be some adjustments made for different kinds of hosting situations.
"nodes may be pages, hosts, or sites..."
A document may include an e-mail, a web site, a file, a combination of files, one or more files...
over the years we've always talked about G being "page focused"]
Yes, we have - especially when it comes to the definition of PageRank. However, several Google folk have talked about there being more domain-wide factors being integrated into their algo recently. The details are pretty much in the area of "secret sauce", but if they say they have done it, I'm sure that they have - and trust would be one obvious area.
Authority is possibly another. I've been trying to outrank one particular page on a subdomain for several years, and the only reason I can see for its dominance are authority factors related to the domain itself, not even the subdomain.
Authority is possibly another. I've been trying to outrank one particular page on a subdomain for several years, and the only reason I can see for its dominance are authority factors related to the domain itself, not even the subdomain.
Relevance comes from having backlinks from sites that have similiar content who also have backlinks from relevant sites.
sites that have similiar content who also have backlinks from relevant sites
This borders on another territory that Google ventured into in 2002 - LocalRank. With LocalRank, the preliminary result set is then analyzed by restricting link juice to just those domains within those results. Then a re-ranking is performed over that preliminary set, based on the LocalRank score.
After six years, I'm sure that the straight-up localrank method has been assimilated into other routines, but the essential principle would remain - and be seen in authority calculations.
What are they a sign of?
I mentioned something about that in the second post - they're a navigational aid for the end user. Google assigns Sitelinks algorithmically for some queries, based on traffic, website architecture, and some other "special sauce" ingredients. Sitelinks began at first for only the strongest domains, and that may have caused some confusion. As they became widespread, it became clear that having Sitelinks for the basic query [domain.com] was not all that special any more.
Now, if you see Sitelinks for a generic keyword query, rather than a full domain "navigational query", then that might show something about authority - but it still isn't about trust.
I'm pretty sure that trust calculations work like this - at least in part. Google starts with a hand picked "seed list" of trusted domains. Then trust calculations can be made that flow from those domains through their links.I've always thought of trust as one of the "query-independent" factors, like PR - that is, it's not related to any topic. So that makes another key differentiating point. Authority is related to a topic, and trust is not.
(spliced from two posts)
So the challenge is to make one's site seed-worthy. Can we assume that the Google Directory has high trust, being on the google.com domain? If so, every entry in DMOZ has inherited a good deal of trust.
However, when you look at a large corporate website and its links to a subsidiary, I think you'll find that the latter has more seed-worthiness than the DMOZ entry that was blessed by Google's trust. This suggests that trustrank distribution is similar to the ways link juice is diluted as you place more links on a page. This is asserted by the following learned paper:
Propagating Trust and Distrust to Demote Web Spam [ftp.informatik.rwth-aachen.de]
How might seed shortlists be created?
Domain ownership would play a strong part, hence whois data washed against stock exchange market cap records would easily separate the mega corporations from the IPOs. Similar comparisons from academic sources would separate the top shelf .edus from the dubious minor .edu institutions. Ditto for .gov sites.
I'd say that trust is usually a domain-wide factor, rather than being confined to a url. The papers often leave this open ended. The Link Spam Detection paper, for instance, discusses analyzing "nodes", where "nodes may be pages, hosts, or sites..."
The trust score of a page is an indication of how trustworthyThen they introduce the concept of DisTrust and BadRank. :) I don't know if any of them have since moved to Google, so their paper may simply be a paper.
the page is on the Web.
And here's Jon Kleinberg's classic publication, also PDF:
Authoritative Sources in a Hyperlinked Environment [cs.cornell.edu]
Hub: Links out to a lot of authoritative documents
Authority site: Linked to by a lot of authoritative documents
An "authority site" is what's often used to describe a site with "authoritative information" but that's not the same sense as how it's used in search.
Now, if you see Sitelinks for a generic keyword query, rather than a full domain "navigational query", then that might show something about authority - but it still isn't about trust.
[edited by: Marcia at 4:30 am (utc) on Sep. 29, 2008]
How about pages found to be selling links and losing the ability to pass PR? Wouldn't that be related to a negative trust factor being applied to the sites selling links, at least for the links in question.
But the point is, does that affect nullifying the benefit of just those links in particular, or all the links on the pages selling them, including internal links and editorial links not being paid for?
In other words, would that affect all the links on the page, or just those in certain segments of the page (visual page segmentation)? Wouldn't this type of PR metric be related to trust - or losing trust? And to what extent for the page or the site overall?
[edited by: Marcia at 11:06 am (utc) on Sep. 29, 2008]
I've seen mentions of trust factor as regards pages (urls) and domains - but what about servers, or hosts?
I am thinking of, for example, those companies that offer "insta pages" that are just one level above a parked domain, with generic RSS article feeds and maybe a picture or two - wouldn't you think that after a few thousand of those are generated with the same content over and over, the entire network would somehow gain a negative trust factor, regardless of domains?
How about pages found to be selling links and losing the ability to pass PR? Wouldn't that be related to a negative trust factor being applied to the sites selling links, at least for the links in question.
I'd say yes. Notice that the quote that I put in the first post taslks about monitoring ads on the page to determine trust?
But the point is, does that affect nullifying the benefit of just those links in particular, or all the links on the pages selling them, including internal links and editorial links not being paid for?
One of the things Google did to whack paid links was to hit the PageRank. We also know that back in January, Google changed "something" about the PageRank formula. Do you think PR now includes a trust component?
what about servers, or hosts?
I've got a strong feeling that servers and hosts can be wrapped into the formula for trust, but on a case by case basis - and this may also have a manual review component if a host is flagged by the algo as looking dicey. The Yahoo paper on Link Spam Detection that I mentioned above talks about nodes, where ""nodes may be pages, hosts, or sites..." Of course, that might mean "hostnames" as in subdomains - and there is no technical definition for "page" which is a very fuzzy concept.
Has anyone ever defined such a list of [seed] documents?
I can only wish. Many people assume that DMOZ is one, and .gov domains are another. The quote from the patent mentions Amazon, so that would be a third, and I'd guess that most of the .int domains are a fourth area for trust seeds. I'm sure the list is a lot bigger than the few I just ticked off.
I can't imagine how we might reverse engineer that list, since Google's trust metrics are not published anywhere, and are not likely to be. So we can only guess, and since even major newspapers have been caught up in the link-selling witch hunt, some of my previous assumptions about trusted sites are pretty much out the window.
Many people assume that DMOZ is one, and .gov domains are another. The quote from the patent mentions Amazon, so that would be a third, and I'd guess that most of the .int domains are a fourth area for trust seeds
In the past, i've tested this theory.
I've gotten links from dmoz, amazon, "supertrustedsite.com", etc. and haven't seen any appreciable improvements to the only thing that matters (to me) in this discussion (rankings).
There are some fringe benefits, but those don't pertain to this discussion.
Basically, I think it's overblown.
You can be "untrusted" which is bad, but being "trusted" doesn't seem to play a major role.
"Authority" is a whole 'nother discussion with lots of ranking benefits.
I've got a strong feeling that servers and hosts can be wrapped into the formula for trust, but on a case by case basis - and this may also have a manual review component if a host is flagged by the algo as looking dicey.
I basically agree - authority matters much more for rankings than trust. One case where trust might be a benefit is if someone starts a Google-bowling type of campaign against your domain. Then having a high trust level may help you hold on to your rankings.
Has anyone ever defined such a list of [seed] documents?I can only wish. Many people assume that DMOZ is one, and .gov domains are another. The quote from the patent mentions Amazon, so that would be a third, and I'd guess that most of the .int domains are a fourth area for trust seeds. I'm sure the list is a lot bigger than the few I just ticked off.
While there would be some hand-selected seed documents, it would take a month of Sundays to compile seed documents covering all possible topics for all trusted domains.
I believe that there has be a second level of algo-selected shortlisted sites that need a quick human confirmation, but are considered to be seed quality until the human review. For example, .edus used to have blanket trust but now all folders (for example) beginning with a tilde might be assigned less trust. This is why I suspect that trust needs to be at a folder or page level and not for a domain.
Tedster, if you were lucky enough to write Google Trust code, what would you write?
To the issue of trust and hosting, I can't help thinking Google uses the old anti-spam email method used years ago where blocks of foreign IP addresses were flagged as untrustworthy.
IF site A appears to be spam AND it is hosted on servers previously identified as being used to spam THEN lower trust rank.
Other hosting trust flags could be frequent server changes; frequently slow-loading home page/other pages; hosting for excessively interlinked sites.
How many other hosting trust issues could there be?
p/g
P.S. Lately I've been wondering if there's a positive flip side to hosting where hosts with good records can actually help your trust and/or ranking (a little). Which seems reasonable.
If I were writing the trust algorithm (fat chance!) I would also start with seed domains, and review that list at least monthly. Then I'd go for a continually updated trust value, something like PageRank.
I'd make sure that small amounts of trust get deducted for various infractions, especially for dicey outbound links. Not for 404's (hey, link rot happens) but for links to neighborhoods with low/no trust values. After the issue is fixed I'd also restore that lost trust in increments and not all at once. Repeated problems would result in a longer term loss of some trust.
Those deductions would happen no matter what value the initial calculation from the seed domains showed. I'm not sure whether those deductions should cascade outward to the legitimate external links - probably not, but I'd need to study the data first before making that decision.
I'd also consider advertising links on the site as part of the trust calculation - even if they were rel="nofollow". Google's "historical data" patent already mentions this.
A more challenging issue would then be how to integrate trust into the overall ranking algorithm. Some urls just need to be in some results because the end users will expect them. So trust deductions would still not affect "Company Name" searches much if at all, whereas they certainly should affect "big keyword" searches, no matter how well known the enterprise.