Forum Moderators: phranque

Message Too Old, No Replies

'pr0n' site hyperlinks brought in by crawler googlebot

         

grobbit

4:18 pm on Apr 12, 2020 (gmt 0)

5+ Year Member



Hi. I have a Linux-based website. For a few months there have been 'pr0n' site hyperlinks appearing in the logs. I know these links are not on the web site because I uploaded a backup site that had not been altered for over a year and I have searched all pages and nothing.

The hyperlinks are always appear with google.bot crawler. It appears that some spammer is able to submit his links to google.bot to crawl the website. For why I do not know but obviously there must be a good enough reason to make 15 or 20 'pr0n' links appear in the site logs daily.

An example...
crawl-66-249-70-50.googlebot.com - - [12/Apr/2020:07:09:35 +0100] "GET /cgi-bin/axs/ax.pl?https://example.prn/beautiful/12345-breen-daniels-in-closeup-vibrations.php?url=search-alias%3Dstripwidgets&field-keywords=Gone+Title%3A+Three+Narratives+of+Some-Blue+Widgets HTTP/1.0" 302 367 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The words after "stripwidgets&field-keywords=" are a book recommended on the site and which pointed to amazon (but no longer does as I changed it, but that hasn't stopped the spam link using it.

The "GET /cgi-bin/axs/ax.pl?" refers to an axs hit script which the site has been using for over a decade - this is how I was first alerted to the 'pr0n' links.

These links only appear when google.bot crawler crawls the site (crawl-66-249-70-50.googlebot.com).

As I said, I have downloaded the site and checked it, and also uploaded an old version of the site which existed long before this activity started happening, so these are not hyperlinks on the web site itself.

Can anybody please throw some light onto this problem? Surely it must be well known within the webmaster fraternity?

Many thanks,

Tony.

[edited by: not2easy at 5:12 pm (utc) on Apr 12, 2020]

not2easy

6:12 pm on Apr 12, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



HTTP/1.0" 302 367
indicates that the visit is not really a Googlebot. "302" means it was temporarily redirected to your page, "367" indicates that the 302 was a tiny .php redirect and HTTP/1.0 confirms the php redirect method. Googlebot does not crawl using a "HTTP/1.0" protocol. Look at your logs.

"How" is easy, "why" is less certain but I have seen these in the past when someone was trying to pass off a non-existing URL for an ad campaign. I took care of that by disallowing the Google Adsbot in robots.txt:
User-agent: Adsbot-Google
Disallow: /

and the confusion went away. That was long ago so the ad campaign here might be another service. Check your logs and try a few searches with those terms.

I found it by carefully examining my raw access logs and then trying to visit the 'referer' both with and without a Googlebot UA. Checking the headers showed me it was a .php redirect.

lammert

6:51 pm on Apr 12, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



crawl-66-249-70-50.googlebot.com is a real reverse PTR name of Google bot on IP address 66.249.70.50, but the reverse PTR may be spoofed by the crawler. You may want to switch off domain name lookups for your webserver. This will tell you the real IP address of the bot and if it originates from the Google IP ranges or from some other range. It will also speed up writing the log file.

grobbit

9:51 am on Apr 13, 2020 (gmt 0)

5+ Year Member



Checking the headers showed me it was a .php redirect


I'm not sure what is meant by a php redirect? Does that mean a crawler pretending to be google.bot was redirected to my site, or does it mean google.bot, or a crawler pretending to be google.bot came to my sire and was redirected to a pr0n site?

Is it possible that this is a way that makes Google think the pr0n hyperlinks are referred to by other sites which makes them more 'popular' and therefore gives them higher search rankings?

[edited by: grobbit at 9:59 am (utc) on Apr 13, 2020]

grobbit

9:56 am on Apr 13, 2020 (gmt 0)

5+ Year Member



crawl-66-249-70-50.googlebot.com is a real reverse PTR name of Google bot on IP address 66.249.70.50, but the reverse PTR may be spoofed by the crawler. You may want to switch off domain name lookups for your webserver. This will tell you the real IP address of the bot and if it originates from the Google IP ranges or from some other range.


Apologies if I'm a bit slow on this. I do not understand. Are you saying that googlebot is a genuine crawler but somehow gets spoofed to be redirected somewhere else? I have asked my host if they can switch off domain name lookups and await a reply. If I can discover an IP range that is not googlebot's IP range, then it would be possible for me to block that IP range, is this right thinking?

lammert

10:39 am on Apr 13, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What I was saying is that any IP address can claim they are crawl-66-249-70-50.googlebot.com. But only the IP address itself will reveal if they are the real Googlebot or a fake. If the bot claims it has the above name, but the IP address is not 66.249.70.50, then it is a fake and you can block it. Note that genuine Googlebots may come from several IP addresses, so be sure to check thoroughly before blocking an IP range. We have a whole forum here [webmasterworld.com] dedicated to identifying IP ranges of genuine and malicious bots.

not2easy

12:07 pm on Apr 13, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Is it possible that this is a way that makes Google think the pr0n hyperlinks are referred to by other sites which makes them more 'popular' and therefore gives them higher search rankings?
Yes, in the situation I saw, someone was running an ad campaign from a completely empty domain and using that tactic to appear that the landing page was on my domain. It does not need to be an ad campaign though, it would create the appearance that their content is temporarily found on your domain. That is probably why you see the HTTP 1.0 and the 302 server response.