Forum Moderators: Robert Charlton & goodroi
Google simply doesn't want to let me go there with the classic message that can be found in a dozen threads [google.com] here over the past few years (including one from a few months ago that I cannot reply to)
Here's a generic example:
[google.com...]
We're sorry...... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now.
We'll restore your access as quickly as possible, so try again soon.
...
Of course I am not infected, and I've tried this from several computers, several OS and different IP. Logged in or out of an account. I've tried varying the number per page and the start position. No luck.
The only variation is that sometimes if I vary what service I use from them (for pda, etc) I will get only of those captcha verification boxes and other times no verification to proceed. Page 2 will get the capture, page 3 is refused.
There are 800 results total, I really would like to at least see page three - any ideas?
I can do a similar request on yahoo and can see the results without issue but it doesn't have all the hits that google has.
Do you think maybe with a developer key and going through one of the api services might do it? I don't know much about that but if it might work I'd lookup how to do it...
I managed to get a similar query through Yahoo to sorta get what I was looking for but Yahoo has some serious bugs in their negative/exclude (NOT) feature. Yahoo results are a mess compared to Google's even in 2007, amazing.
ps. sorry, I'm not giving out the query because the more people that do it, the more it looks like a virus to Google. It does have the word "forums" in it so I suspect that maybe spammers use Google to get lists of forums? But that can't be the only trigger...
I've recently came across a HACKED Apache installation, that contained a small program written in PHP. From what I was able to understand the program queried G, Y, M, and Gigablast with combination of operators: inurl:, intitle: and intext:.
There was a log of what was going on and it contained over 7500 different variations of queries performed -guestbooks, forums, comment-forms. Clever app. I have to say, it even contained the names of the files they were looking for.
For a while I thought that using a dial-up account was the path around the error message, but recently Ibanged into on dial-up too. And I have sometimes seen the message even on site: operator queries for new sites - I really doubt scrapers would care about those searches.
[edited by: amznVibe at 7:05 pm (utc) on Sep. 1, 2007]
Let's be frank here, we ain't stupid!
PS: Try other engines to see if it works but msn has many funny searches figured out and yahoo blocks you if you are too perseverent.
800 results .. lemme guess! a guestbook .. ;)
If you are searching your own site you usally use only site and browse based on structure. But site: and inurl: has virtually no use for site owners. If you structured the site properly you will find what you need with site! And inurl won't help you a lot as filenames usually have different names.
Eg: http// domain.com / level1 / level2 / title-of-content-page
You can use site: on this to browse by levels. And inurl: can be replaced by site:. I'm sure you don't want to find how many pages on your site have a word url. And I'm sure not more than 30?
PS:I suggest finding a footprint in the guestbooks or forums or blogs and use that for search. That's the way to go!
And www (non-www) problem can be easily fixed in .htaccess.
If you do the search you must be searching if you have / not have problems, not what pages have.
If one has problems all have as relative links will make all site crawlable.
And fixing non www problems is done in .htaccess (sitewide) so finding all pages with problems is futile.
I rest my case ... no legitimate use for them together, and even if there would be going over page 3 ... I don't think so.
no legitimate use for them together
Sure there is - have you ever tried to audit a major website with 6 or 7 figures worth of URLs indexed? Especially when the business has not done a good job of managing its legacy URL structure?
I understand that Google does need to guard against automated queries, but when this error message comes up on my second or third click, I really do wonder. And I don't have any malware making hidden queries either - I have at times monitored every packet my box is sending out in these situations.
So they are cool for result count but not to check each page of results.
PS: Google limits 1k of results. I'd pay to see you check a 6-7 figures website page by page chasing URLs.
come on. I work with this stuff ;)
...
Ohkay...
Well, I have this strange urge trying to tell apart the types of people on this forum ( webmasters, SEOs, spammers, MFA'ers, coders, designers, moms and pops and the combos of either ), and have to thank you for this revealing thread, it sure is funny. Your lecturing of g1smd ( playing the part of 'forum/guestbook spammer - B' ) on .htaccess and canonical issues was a blast.
You could probably tell me who *I* am.
And in the meantime 'reveal' some additional high profile ( albeit from an SEO stand point somewhat uneasy ) practices on how to spam worthelss low quality pages in the bulk, I mean... only a few tricks you would accidentally know of because of your coder background. I'd be really grateful because the IBL *count* on my sites is way too low.
...
off topic, again: Google once gave me this error when I was so lazy I didn't even type in our in-house rank checker's URL, and kept on clicking from SERP to SERP too fast, trying to find -950 pages that were scattered throughout the results. The site had - surprise surprise - relevancy problems, its inbound links used an abbreviation, and the site used the full phrase. I was able to click so fast, and at such a steady pace, I seemed like a robot.
Dunno who would have taken the day off after this. *cough* I switched to a browser w/o the toolbar ( yes I have it installed ), and finished the remaining stuff at light speed. Btw site is out of -950 for everything it needs to be out for.
I am simply looking at how many forums are out there with a certain very specific subject. It's an unusual query but not an illicit one. I wasn't aware spammers cared about specifics and wouldn't they just use yahoo/msn instead?
ps. ahem and I am also not "my man" - there are women on webmasterworld
[edited by: amznVibe at 3:34 pm (utc) on Sep. 2, 2007]
If you read the stuff I wrote a couple of years ago about sites with parameters in URLs, you will see what sort of things I have been dealing with.
In particular, forums like vbulletin can expose up to 20 URL formats for each thread and a similar number for the thread listings. These consist of having the parameters in different orders and/or having extra parameters on some of them.
This is a major Duplicate Content issue that I have written about several times in the last few years. The only way to track down every rogue format is to search for them. In many cases these are problems that cannot be entirely fixed using .htaccess but instead require some editing of the scripts. A quick and dirty way is to block some formats using robots.txt but blocking the wrong one may stop bots indexing the site at all.
Which would you block or modify?
www.domain.com/forum/forumdisplay.php?f=33&page=59
www.domain.com/forum/forumdisplay.php?page=59&f=33
www.domain.com/forum/forumdisplay.php?f=33&page=59&order=desc
www.domain.com/forum/forumdisplay.php?f=33&order=desc&page=59
www.domain.com/forum/forumdisplay.php?order=desc&f=33&page=59
www.domain.com/forum/forumdisplay.php?order=desc&page=59&f=33
and another 20 to 30 formats variously having additional "daysprune", "do", "pp", "session", "sort" and other such parameters...