Forum Moderators: open

Message Too Old, No Replies

SemrushBot

         

keyplyr

9:40 pm on Dec 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Mozilla/5.0 (compatible; SemrushBot/1.1~bl; +http://www.semrush.com/bot.html)
Protocol: HTTP/1.1
Robots.txt: No
Host: advancedhosters.com
46.229.160.0 - 46.229.175.255
46.229.160.0/20

Data collection for marketing products

More advancedhosters.com:
88.208.16.0/21
88.208.16.0 - 88.208.23.255

Archived thread: [webmasterworld.com...]

lucy24

6:50 pm on Dec 4, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Robots.txt: No

Actually, Yes--but you may not have noticed (or may have deliberately chosen to count it as a No) because it typically requests robots.txt a couple of days before a crawl, and then works off the same copy for quite a while. But eventually it does comply. It must, or I wouldn't have poked a (header-based) hole for it.

A more striking feature of this robot is its ongoing dimwittedness. It constantly asks for things like
/filename.html
when the true form is
/directory/filename.html
as if it can't understand relative links. It also habitually asks for pages at Site A that have never existed there, but only came into being after moving to Site B (to which most of Site A now redirects, file by file).

With all this, it's surprising that it does seem to know how to pick its name out of a list, as in
User-Agent: something
User-Agent: SemrushBot
User-Agent: something-else
Disallow: blahblah

(Aside: To date, I've only met one robot that doesn't understand this construction, but does become compliant when given a "Disallow" block of its own.)

keyplyr

11:26 pm on Dec 4, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This bot has no benefit for my interests so it does not get through the filters but it would be logged if it ever requested robots.txt.

The only reason I might not notice would be if it was stealth in some manner. I do get a few robots.txt requests from browser spoofs every day & I freely allow this file to be accessible to any/all.

However if the UA/IP goes on to get caught in a filter, their history is logged... thus I see if they requested robots.txt but it would need to be with the same UA/IP otherwise an agent could come in under a spoofed UA, get robots.txt, then come back later under a different UA/IP and attempt a crawl & get caught but I would not know they came by with different credentials earlier to get robots.txt - nor would I care (in most cases.) whew!

gesh

4:00 am on Dec 8, 2016 (gmt 0)

10+ Year Member



The bot does request robots.txt but the request comes from different IP in the same range. It uses the same user agent when fetching robots.txt and other files.

keyplyr

7:13 am on Dec 8, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Good to know about requesting robots.txt. Thanks for the input. If this agent comes around my properties again, I 'll give it extra attention.

Requesting robots.txt, and more so following the directives, is fundamental to the rights of webmasters.

<added>

While I didn't find this UA 's request for robots.txt, the disallow is there (updated 3 weeks ago) so if in fact this agent did request it (even from a different IP address) robots.txt was not obeyed, so the block stays up.