Forum Moderators: phranque
2017-11-25:17:25:31
2017-11-25:17:25:31
URL: /xmlrpc.php
IP: 85.93.88.nnn
URL: /blog/xmlrpc.php
IP: 85.93.88.nnn
Cookie: wordpress_test_cookie=WP+Cookie+check
Cookie: wordpress_test_cookie=WP+Cookie+check
Accept-Language: en-US,en;q=0.8
Accept-Language: en-US,en;q=0.8
Content-Type: application/x-www-form-urlencoded
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; fr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; fr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8
Cache-Control: max-age=0
Cache-Control: max-age=0
Content-Length: 217
Content-Length: 217
Connection: close
Connection: close
Host: example.com
Host: example.com
And here's a rare one involving two different legitimate robots: 2017-12-21:05:12:41
2017-12-21:05:12:41
URL: /
IP: 157.55.39.189
URL: /robots.txt
IP: 68.180.230.166
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Host: www.example.com
From: bingbot(at)microsoft.com
Accept-Encoding: gzip, deflate
Connection: close
Accept: */*
Accept: */*
Pragma: no-cache
Host: www.example.com
Connection: close
User-Agent: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Cache-Control: no-cache
Is there any way to simply place the include necessary to write to the header log directly in the .htaccess file itselfI suppose you could do it by rewriting all page requests, both external and internal (like requests for the custom 403 page), to a php file that first invokes the logheaders function and then includes the entire requested page. That's how I log headers on requests for robots.txt. But I tend to doubt this is the best way to achieve the intended result. You can't put php directly in htaccess, if that's what you meant.
Consider using crawl-rate and crawl-delay directives instead.Does bingbot honor crawl-delay? That Other Search Engine, as we all know, explicitly doesn't.
Does bingbot honor crawl-delay? That Other Search Engine, as we all know, explicitly doesn't.
I get more hits from DuckDuckGo, which crawls 1/100th as much as the one IP I still allow Bing.I think you misunderstand the way indexing works. You cannot block all Bing IPs except one and expect to benefit from the Bing Search Index sending you good traffic. All you're doing is blocking your site from being indexed, which translates to blocking visitors. No wonder you aren't getting as much traffic as DDG.
bingbot does not obey your every crawl rate whim.Sorry, I don't understand your meaning.
[edited by: keyplyr at 3:50 am (utc) on Oct 21, 2018]
Did you have any thoughts about my access header logging?It may be sufficient to be used alone for smaller sites, however I feel blocking by header should be limited to only a part of the comprehensive security approach of medium to larger sites.
It may very well be - I'll have to study this further before piping up again - that my act of blindly opening the only copy of my header log instead of copying it elsewhere and reading that immediately disrupts any attempt to write to it, thus leaving these otherwise inexplicable lacunae in the logging.Oh, cripes. I don't think it would ever occur to me to open an in-progress log file. Well, in the case of access and error logs, I'm pretty sure I couldn't if I wanted to; in all cases I download them to my HD and take it from there. Come to think of it, Fetch interprets a double-click as Download, so I could never open a file by accident. (Plenty of accidental downloads, though, and then I have to remember where the computer puts them by default.)
The most apparent thing I've noticed so far logging headers is how few visitors actually get logged, successful 200s and, now, 403s (at least me). By far the most regular entry in my header logs is the big G,.. . Perhaps this follows logically from G spidering page after actual page, while others may somehow be hitting my site in ways that do not trigger the header.php includes.
ErrorDocument 403 /forbidden.php <h1>message</h1>and the identical includes code from up above from my header.php theme/child theme file.
210.112.232.* [10/Oct/2018:17:15:51 POST /blog/xmlrpc.php HTTP/1.1 403 638-Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; fr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8
2018-10-10:17:15:51
URL: /blog/xmlrpc.php
IP: 210.112.232.*
Content-Length: 217
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.8
Cache-Control: max-age=0
Connection: keep-alive
Cookie: wordpress_test_cookie=WP+Cookie+check
Host: example.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; fr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8
[edited by: not2easy at 8:35 pm (utc) on Oct 21, 2018]
[edit reason] anonymized IP with * [/edit]
It shows you just how few humans there are on your site, and just how many bots are scraping you. If a bot only access an image, you won't get a header logging.But the vast majority of robots request only pages, and every one of those--including the custom 403 page sent out on every 403, regardless of requested filetype--includes logheaders code. Currently, on my site the only requests whose headers don't get logged are the ones intercepted by mod_security on the server level. (I know they exist, because they're listed in access logs and error logs, but they never reach my userspace.)
But the vast majority of robots request only pages