Forum Moderators: mack

Message Too Old, No Replies

Best method to stop Bing indexing old crawls

         

jehoshua

11:46 pm on Feb 6, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Bing keeps on crawling our site using a (very) old sitemap details. I have done the new sitemap within the Bing webmaster options, sent Bing an email about it, yet the crawling using the old sitemap continues. Here is an example:

GET /index.php?xml_sitemap=params=pt-post-2014-10 HTTP/1.1
GET /index.php?xml_sitemap=params=pt-post-2013-11 HTTP/1.1
GET /?p=35 HTTP/1.1
GET /?m=201312 HTTP/1.1
GET /?page_id=2 HTTP/1.1
GET /?paged=5&cat=1 HTTP/1.1


That is just a sample; there are hundreds of these each day. The server response is a "200" so any crawler will think it is okay. A number of options to stop this as follows:

1. Modify .htaccess to return a 404 or similar on those links
2. Modify the file index.php and write some code to force a 404 or similar.
3. Modify robots.txt
4. Other methods ?

The current robots.txt file is

User-agent: *
sitemap: http://example.com/Sitemap.xml

User-agent: MauiBot
Disallow: /


The file Sitemap.xml is the new one and I have validated the robots.txt via Bing a number of times. Have also requested a reindex by Bing, all to no avail. What is the best method to address this please ?

lucy24

1:27 am on Feb 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't believe any power on earth will make bing forget an URL. Pages removed up to three years ago? Still requested by bingbot. Pages redirected up to seven years ago? Still requested by bingbot. Pages changed to HTTPS over a year ago? Still requested as HTTP by bingbot. (In each of these cases, bingbot is not unique, but it definitely mops up a disproportionate number of redirects.) Oh, and they persist in requesting half a dozen variants of the sitemap--other than sitemap.xml--even though I have never in my life done anything to let them think some other name is in use.

Here is an example:
Has your site entirely stopped using parameters, or have only selected ones been dropped? Even if bingbot is the most visible annoyance, there are other approaches, such as redirecting any with-query request to the queryless version of the same URL, or returning a 410 for URLs that are genuinely gone and have no current equivalent. What happens to humans who have an outdated URL bookmarked?

Some crawlers will recognize robots.txt directives involving parameters; some won’t.

It's generally more efficient to return responses such as 404 straight from the server, rather than let it get all the way to /index.php and making it do the work. It also makes it easier to see what's going on if you ever have occasion to study your server access logs, since a 404 or 410 will then be logged as such, instead of as 200.

jehoshua

3:46 am on Feb 7, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



I don't believe any power on earth will make bing forget an URL. Pages removed up to three years ago? Still requested by bingbot. Pages redirected up to seven years ago? Still requested by bingbot. {snip}


Oh okay, sounds like you sure have had a lot of similar problems over the years. On the word "bing" I looked up what it means ..

Definition 1 - "a heap, especially of metallic ore or of waste from a mine."
Definition 2 - "Prison solitary confinement, a term used by inmates."

Has your site entirely stopped using parameters, or have only selected ones been dropped?


It was a Wordpress site, so it used to have lots of parameters. Currently it is being re-built and has only two php files and seven html (static) pages. No parameters at all.

Even if bingbot is the most visible annoyance, there are other approaches, such as redirecting any with-query request to the queryless version of the same URL, or returning a 410 for URLs that are genuinely gone and have no current equivalent.


I had never heard of a 410, and see it means gone. Yes I like that error code, thanks.

What happens to humans who have an outdated URL bookmarked?


Yes, good point. The actual source of a lot of these may be humans. The old site/content has been gone now for nearly 12 months , yet every day I see attempts to access the login area, admin areas, plugins that can be used for SQL injections, etc,etc.

Some crawlers will recognize robots.txt directives involving parameters; some won’t.


Okay thanks.

It's generally more efficient to return responses such as 404 straight from the server, rather than let it get all the way to /index.php and making it do the work. It also makes it easier to see what's going on if you ever have occasion to study your server access logs, since a 404 or 410 will then be logged as such, instead of as 200.


Thanks, that has given me the start of a plan to try and address this. :)

jehoshua

4:37 am on Feb 7, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Have done minor testing, as most of those crawling requests have a "?" in them, this seems to work okay.

if (false !== strpos($_SERVER['REQUEST_URI'], '?')) {
// There is a query string (including cases when it's empty)
echo 'found question mark';
}


Just need to send the 410 instead of the echo . :)

lucy24

5:59 am on Feb 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: quick trip to php dot net to confirm a hunch, because I only speak about three words of php ::

Yup, there's a $_SERVER['QUERY_STRING'] so it ought to be possible simply to check whether it exists at all, rather than futzing around with strpos. (I thought there must be, since you can say %{QUERY_STRING} in a RewriteCond, and those are mostly standard server variables.)

jehoshua

7:34 am on Feb 7, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Yes, a RewriteCond in .htaccess will be a better method than PHP I feel. Thanks