Forum Moderators: open

My website was vacuumed up by OpenAI

         

SumGuy

4:14 am on Apr 8, 2026 (gmt 0)

5+ Year Member Top Contributors Of The Month



A couple days ago my website was vacuumed up by OpenAI. About 300 mb, 600 file requests. Even accessory files like small gif's to render page frames and graphics. Even requesting "editdata.mso" files that may have been in my filelist.xml but have long been deleted. This all happened over the space of about an hour.

It mostly happened from 74.7.241.31 and 74.7.242.174.

These are Microsoft IP's, with no host names (no reverse-dns). Scamalytics provides a bit of extra info, naming the organization behind the IP's as "Cloud". Spur ID's the IP's as OPENAI crawler.

I've seen hits from all over 74.7.x.y from OpenAI's GPTBot since mid last year so these IP's are not out of the ordinary, but nothing like today's behavior. User-agent was:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)

Either -

OpenAI has changed strategy - instead of asking for a dozen of my (mostly PDF) files on any given day, lots of duplicate requests when looking at any given week, they're now downloading / caching entire sites - maybe according to some criteria?

Or - someone gave ChatGPT a specific directive to look at my site, perform an analysis on my site (or my Company), which resulted in this download frenzy.

Who else sees massive entire-site downloading from OpenAI?

shawnb61

6:47 pm on Apr 8, 2026 (gmt 0)

Top Contributors Of The Month



I saw a significant increase in OpenAI & ChatGPT starting ~3/23. Crawling everything, over & over. It's starting to fall back to normal volume now, though.

Our attachments aren't available to guests/bots. But they were getting whatever they can.

jmccormac

6:27 pm on Apr 13, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



MSFT is using some RIPE ranges and has the netname "cloud". Perhaps that's where the identification is coming from on the range. The netnames are not necessarily the range owner's name. It generally respects robots.txt and I don't think that I've seen any problems with it. Did it request robots.txt first? If not, deepsix the ranges.

Regards...jmcc