Forum Moderators: open

Message Too Old, No Replies

GPTBot

OpenAI web crawler

         

ClosedForLunch

6:58 pm on Aug 6, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



Just had over 1000 hits from this bot, hitting individual pages. As it happens my site automatically served a 403 for each hit because the bot is not in my whitelist, nor did it pass the 'human' test.

User agent :

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

From this microsoft IP :

40.83.2.64/28

robots.txt disallow :

User-agent: GPTBot
Disallow: /

tangor

3:27 am on Aug 7, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've seen one other "ai bot" attempting the same, but it was half-hearted and did not rise to the level of nuisance. Have a feeling, however, that many more are masquerading with human like UAs all in the quest to fill their coffers with "the latest information" to fuel the computerized babble.

brotherhood of LAN

6:51 am on Aug 7, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




[platform.openai.com...]


User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)




[edited by: not2easy at 3:21 pm (utc) on Aug 7, 2023]
[edit reason] splice cleanup [/edit]

lucy24

6:30 pm on Aug 7, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: quick run to raw logs ::

Oh, willya look at that. 5 August (Saturday): 548 blocked requests from the described UA and IP. I don't generally notice blocked requests, and 548 (three http, the rest https) isn't enough to make logs noticeably plumper.

Oddly, every last request got a 429. (I do not understand this response; I remember it started showing up in logs last October or so, and tend to suspect host is using it in a non-standard way.) That includes three requests for robots.txt. Aren't they supposed to wait a reasonable amount of time before trying to crawl when there's a problem accessing robots.txt itself?

jmccormac

12:59 am on Aug 19, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Seeing this pest hitting my website with tens of thousands of requests a day. Have 403ed it for a while to see if it was a well-written crawler and then moved to an IP level block across all published IP ranges. Basically, the site has the hosting history of over 800 million domain names back to 2000 and this is what the bot seems to be trying to get.

[openai.com...]

20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28

Regards...jmcc

jmccormac

1:16 am on Aug 19, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@tangor Have senn a few botnet scaper operations since late April. One of them included the scraper API token. Others were more subtle. There is an active (as recently as yesterday) 18/August/2023) botnet (possibly related) using a combination of ISP and mobile ISP IP ranges.

Regards...jmcc

jmccormac

3:36 am on Aug 19, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Reset the counters at approximately 02:05 and at 04:30 the block count is up to Openai GPTbot 12,923 requests.

Regards...jmcc

lucy24

6:07 am on Aug 19, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: repeat visit to logs ::

Fancy that. Three days later, on the 8th, they tried my personal site. Same pattern--three http, all others https--this time from 20.14.240-242, with requests coming at roughly ten-seccond intervals.It looks as if the ones that happened to get 429 were attempted again later, this time getting the expected 403. (Ha!)

All requests were for pages that are accessible from the root, either directly or indirectly, but they didn't go in spidering order; that is, they started with some random interior pages, meaning that they already knew what to ask for. And--a truly unexpected point--they appear to have been robots.txt compliant. There were no requests for anything in roboted-out directories. (Since they were hitherto unknown to me, they did not find their own name, just the generic don't-go-here-no-matter-who-you-are.)

At some time when I wasn't looking, the whole chunk 20.0-15 was acquired by Microsoft, though I never noticed because they haven't used it for much of anything. Microsoft isn't going into the AWS or G*** Cloud, business, is it?

jmccormac

6:26 am on Aug 19, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Microsoft Azure? Think I've seen a 20.b.c.d Bingbot and also a DuckDuckGo crawler. Block count is at 25,594 since 02:05. It is blocked at an IP level and was 403ed prior to that. The requests were spread over the ranges of IPs. The Openai site doesn't have any contact details so deepsixing the IP ranges is probably the best way to deal with it.

Regards...jmcc

blend27

3:40 pm on Sep 2, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This thing just ATE 1237 403's, came from 20.15.240.70.

My question is where did they get a list of URLs to try to GET?

@jmccormac ---- while to see if it was a well-written crawler

Not in my case, this thing got tripped on 1 pixel Bot Trap that is a last link on a page and disallowed in Robots.txt file.

blend27

4:45 pm on Sep 2, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just checked 3 other sites, more than 17000 requests. Disallowed directories are requested, all denied though...

Here is a piece of code for IIS Web.config that completely DROPS all requests originated from 9 /28's listed on their site:
<security>
<ipSecurity allowUnlisted="true" denyAction="AbortRequest">
<!-- GPTBot openai.com/gptbot-ranges.txt -->
<add ipAddress="20.15.240.64" subnetMask="255.255.255.240"/> <!-- 20.15.240.64/28 -->
<add ipAddress="20.15.240.80" subnetMask="255.255.255.240"/> <!-- 20.15.240.80/28 -->
<add ipAddress="20.15.240.96" subnetMask="255.255.255.240"/> <!-- 20.15.240.96/28 -->
<add ipAddress="20.15.240.176" subnetMask="255.255.255.240"/> <!-- 20.15.240.176/28 -->
<add ipAddress="20.15.241.0" subnetMask="255.255.255.240"/> <!-- 20.15.241.0/28 -->
<add ipAddress="20.15.242.128" subnetMask="255.255.255.240"/> <!-- 20.15.242.128/28 -->
<add ipAddress="20.15.242.144" subnetMask="255.255.255.240"/> <!-- 20.15.242.144/28 -->
<add ipAddress="20.15.242.192" subnetMask="255.255.255.240"/> <!-- 20.15.242.192/28 -->
<add ipAddress="40.83.2.64" subnetMask="255.255.255.240"/> <!-- 40.83.2.64/28 -->
</ipSecurity>
</security>

lucy24

5:31 am on Sep 3, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is this a robotic script that anyone can use? I just found it visiting a minor site whose logs I only process once a month, so “just found” = it was there several weeks ago. Again, random pattern of requests, again omitting roboted-out directories, again from 20.15.whatever-it-was, again at ten-second intervals. That does make me suspect it's got an option for ignoring/honoring robots.txt--and maybe another for spacing of requests--which would explain the widely different results seen in this thread.

So far I haven't seen it at all on two test sites that are wholly roboted-out--not even to check for robots.txt. Does it have a plainclothes sidekick to do preliminary scouting?