Forum Moderators: open

Message Too Old, No Replies

How can Windows IIS 10 prevent scraping?

scrape

         

Kendo

12:14 am on Aug 9, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am confounded by the inability to scrape a web site on IIS 10 while another website on the same server scrapes ok.

All I am trying to do is get a simple "Hello World!" and it doesn't matter if that text is created by response.write or is a static html page.

I can find no difference in the config between each site, and have tried scraping from an outside Apache server using both CURL and Simple_html_dom methods. But still get a blank.

What could possibly be preventing scrape?

phranque

1:41 am on Aug 9, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



But still get a blank.

what HTTP Response do you get from your HTTP Request?
status code?
headers with useful information?
a document in the response?
do you ever get a connection to the web server?
does the DNS request resolve?
etc...

Kendo

2:09 am on Aug 9, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No connection problem. originally I thought that my partner's test server may be blocked but also tried the scrapes from my own Apache server... same blank.

Visiting the pages in any browser I can see the response as text.

phranque

2:56 am on Aug 9, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Visiting the pages in any browser I can see the response as text.

we can start with this information.
so you got a response?
200 OK status code?
the text you see is the document you expected?'

... tried the scrapes from my own Apache server... same blank.

are you saying you cannot see a document when you use another user agent (not a browser) such as curl?
what status code do you get when you use curl?
have you tried specifying a user agent string (using the curl -A option) that looks more like a human-operated browser?

Kendo

3:13 am on Aug 9, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My own tests used simple_html_dom. But my partner used curl so I will ask him to try.

Kendo

4:27 pm on Aug 9, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Problem solved. Phranque was correct in pointing out that a user-agent might help. So I added these 3 lines to the CURL script:

$userAgent = 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0';

curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_REFERER, $_SERVER['SERVER_NAME']);