Forum Moderators: open

Message Too Old, No Replies

User-agent: Scrapy

         

rb77

1:50 pm on Aug 6, 2022 (gmt 0)

10+ Year Member Top Contributors Of The Month



I’ve just looked at my robots.txt file and for some reason it has the following in there.

User-agent: Scrapy
Allow: /

What the heck is that?

Everything I’ve found so far has been about web scrapers. So I’ve no idea how it got in there.

not2easy

4:04 pm on Aug 6, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Who has the access and ability to edit and upload the robots.txt file? Have you ever seen a UA named Scrapy in your access logs? What is the file date of your robots.txt file that shows Scrapy is allowed?

rb77

4:36 pm on Aug 6, 2022 (gmt 0)

10+ Year Member Top Contributors Of The Month



The only other people who had access to my site were Flywheel hosting whilst they migrated this site to their servers.

Nope, can’t see a UA named Scrapy

How do I see the file dates?

not2easy

4:51 pm on Aug 6, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



You should be able to see the dates of files on your computer when you open the folder they are in though I would image there are settings so if you don't see the dates you should look into settings. It would be very difficult to keep things up to date with no way to know the date of a file, so I would look around your settings first. I do not know what OS or device you are using so it is impossible to give more detailed suggestions. You can see the date in most FTP clients. You can see the date in CP file manager interface.

lucy24

5:34 pm on Aug 6, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Scrapy must have been much more active in years past: I have it denied in robots.txt and flagged as bad_agent, but I just checked and have only seen it four times--from three different IPs--this entire calendar year. It's one of those UA names that you just know is up to no good :)

:: detour to site listed in UA string ::
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
For “need”, read “want”.

:: deeper dive into archived logs ::

Ah. There was an absolute blizzard of activity in March 2021, requesting every page in sight (not literally true, so there must have been a pattern), and to a lesser extent in 2018. I don't know why I deny them in robots.txt when they have never once asked.

Conclusion: There's no reason not to Disallow them in robots.txt, but don't expect it to make any difference. For that, you have to block them by name.

Edit: But what does this have to do with Google SEO?

rb77

6:01 pm on Aug 6, 2022 (gmt 0)

10+ Year Member Top Contributors Of The Month



Edit: But what does this have to do with Google SEO?

I’ve no idea to be honest, maybe nothing, but this was written into my robots.txt file:

User-agent: Scrapy
Allow: /

But I don’t know when that was, yet.

I’m assuming it’s in there so someone can copy everything I’ve written?

Or is there another reason for it?

not2easy

6:06 pm on Aug 6, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I would take it out. The suggestions were to try to assist you in finding out how it got there. The date of the file could help.
Edit: But what does this have to do with Google SEO?
I think lucy24's question was to tell me to wake up and move the thread to where it belongs. That has been done now. ;)

rb77

6:35 pm on Aug 6, 2022 (gmt 0)

10+ Year Member Top Contributors Of The Month



Yep, appreciated. I just needed it to be seen by some experienced folks asap.

I took it out straight away.

The only use I can see for it is to scrape my content.

phranque

7:48 pm on Aug 6, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I just needed it to be seen by some experienced folks asap.

please post your thread in the relevant subforum.
as you can see here, it makes less work for our volunteer moderators.
thank you!

phranque

8:00 pm on Aug 6, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



User-agent: Scrapy
Allow: /

it is default behavior to allow a bot so these directives would only make sense if it was to poke a hole for a less specific directive.

if i had to guess, someone within your organization was using a Scrapy-based tool in the past and had to poke a hole in an existing bot-excluding directive.

lucy24

9:50 pm on Aug 6, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



using a Scrapy-based tool in the past and had to poke a hole in an existing bot-excluding directive
Further poring over archived logs--now that we're in SSID--suggests that users can readily modify the script to make Scrapy disregard robots.txt. It comes from a random seleciton of IPs, presumably the IP of whoever is running the script. (I didn't check whether they are server ranges, human ranges or both.) Sometimes it’s compliant but more often not.

fwiw, the earliest appearance I find is in 2014. Logs go back a few years before that, so the robot may really have been born that year.

rb77

10:39 am on Aug 7, 2022 (gmt 0)

10+ Year Member Top Contributors Of The Month



“if i had to guess, someone within your organization was using a Scrapy-based tool in the past and had to poke a hole in an existing bot-excluding directive”.

And here’s the thing. Apart from my dev, I’m the only person who has access to the WP admin.

The only other time anyone has had access is when Flywheel migrated my site.