Forum Moderators: open

Message Too Old, No Replies

Amazonbot again

Amazonbot - block or allow?

         

jmccormac

10:33 pm on Dec 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Even though Amazonbot is supposed to do wonderful things for Alexa and Amazon, it seems to have some issues on respecting a 403 result. There was an uptick in its activity on my site recently. Since December 12th, there have been 1,959,461 requests. (The site has hundreds millions of webpages.) The hostnames check. An e-mail to the the e-mail address on the help page went unanswered. There was a switch to a 403 block prior to this and that didn't seem to work. The next step was an IP level block.

Has anyone else had problems with this bot or is it just a vanity project for Amazon that will not drive an traffic to websites?

Regards...jmcc

lucy24

5:39 pm on Dec 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you mean that it’s been eating a steady diet of 403s without altering its behavior? Frankly that’s typical of any robot, good or bad; once they’ve got their shopping list, nothing will make them stop. But the fact that it’s getting 403s means it has been ignoring your robots.txt Disallow, which by itself should be reason to block any robot.

:: detour to htaccess to confirm that I’ve poked three holes (two for headers, one for bad_range), which means that at some time in the past they passed the robots.txt test ::

:: further detour to logs in case there has been a change in behavior ::

fwiw, it looks as if they decided to become robots.txt compliant somewhere in 2023. (Logs suggest I removed the Disallow in September, after many months of nothing but robots.txt requests.)

Uhm. That doesn’t actually answer your question, does it.

jmccormac

8:20 pm on Dec 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



403s before and then blocked at IP level since November 12th. Deepsixing was the more effective option. It was the scale of the 403s that pointed to very bad design. Just checked the latest block count and it is 2,230,227. The IP block means no content and it is still hammering away. Taking time to become robots.txt compliant may have seemed like a good thing for the people who designed it.The problem for them is that there is no quid pro quo for downloading lots of content from websites. It might be worth unblocking it and trying a robots.txt test.

Regards...jmcc

lucy24

5:23 am on Dec 20, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, the point of robots.txt is that the only thing better than a blocked request is a request that isn’t made at all. So even if you have every reason to believe that a given entity will be non-compliant, you still put up the Disallow. Then, when you sit in your securely deadbolted office, you’re not even momentarily distracted by people rattling the doorknob, because they’ve seen and assimilated the No Admittance sign.

:: further detour to raw logs because I forgot to check one thing ::

Any given Amazonbot visit can range from two or three requests (starting with robots.txt) to several hundred, including images. But they’ve never requested a script or stylesheet, where “never” = calendar years 2023-2024. This detail by itself probably conveys significant information if you know how to parse it. (I don’t.)

jmccormac

10:34 am on Dec 30, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Removed the IP blocks and the 403 this morning. Amazonbot grabbed robots.txt and will, according to its webpage, act on the robots.txt within 24 hours. So far, no more activity from Amazonbot.

Regards...jmcc

tangor

11:17 pm on Dec 30, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



robots.txt can be handy --- IF the players are being compliant!

How often does that happen?

jmccormac

8:06 am on Dec 31, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Great when it works. A nightmare when it is ignored. It was the scale of the spidering and the number of hosts that made Amazonbot unusual. The website runs on an old Athlon II server but it was more than able to handle the load given that it is database backed and mainly read-only. I've seen some reports of websites being overwhelmed and it is often due to limited server resources (shared hosting) combined with unoptimised SQL and database schema. With scalability, one of the big bottlenecks is the number of queries per second that the server can handle and another being the number of queries needed to generate a page. Amazonbot could have been the perfect storm for some of the affected sites as the problems wouldn't show up until a website is put under load. Crawl rate limiting might be something that the Amazonbot developers should consider.

Regards...jmcc

lucy24

5:56 pm on Dec 31, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't know if Amazonbot honors the Crawl-Delay directive--G*** doesn’t--but like everything else in robot.txt, it can’t hurt. Mine’s set to three seconds, i.e.
Crawl-Delay: 3
though I’ve never bothered to check if anyone, anywhere honors it. (Come to think of it, the w3c link checker doesn’t either, but they do space requests 1 second apart.)

jmccormac

7:23 am on Jan 2, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Seems to have been just a Microsoft/Bing option. It tends to be a bit strange in its crawling behaviour.

Regards...jmcc