crawl and download all pdf links

Forum Moderators: coopster

Message Too Old, No Replies

crawl and download all pdf links

jackvull

7:59 am on May 22, 2011 (gmt 0)

I have a webpage , which is full of PDF links (a pdf library). I would like to crawl through the page and download each PDF in turn.
What is the best way to go about this? Or is there an alternative like a free PDF downloader?

coopster

3:14 pm on May 24, 2011 (gmt 0)

Just one page? In pseudocode, I would

Retrieve the page using PHP (file_get_contents or something would work)
Use a regular expression to find all href attributes with ".pdf" at the end of the value
Loop through the links found in the last step and use file_get_contents or something again to grab each document in turn
During the loop, write each files contents to your filesystem or database table, whichever method of storage you have decided to employ

andrewsmd

5:13 pm on May 24, 2011 (gmt 0)

I actually did this a couple years back. We were having to check all of our pdfs on file from this huge list. I wrote a program that grabbed the html using curl, then went through the contents. If this is a server generated page, then there should be some sort of pattern to it. If you can find a pattern, then you can find the pdf link. Using the regex like coopster said would probably work, but regex's can be challenging for newbies. Try this code out and hit back here when you have questions.

//put your url here of where the pdfs are
$url = "http://www.example.com";

//open with curl
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$html=curl_exec($ch);

//get the entire html contents of the page as one large string
$html = htmlentities($html);
curl_close($ch);

//now what I did was I split the contents of the html
//based on an a href because these were in links
//this part you would need to edit to fit a pattern
//you find. You could also use a regex instead
$arrayPDFs = split("a href", $html);

//once we have an array split on the links,
//we loop through
foreach($arrayPDFs as $key => $i){

//get the end of the link because it's a pdf
$end = strpos($i, ".pdf");

//you would need to find the text that leads up to the pdf
//i.e. http://www.example.com/someFile.pdf
$start = strpos($i, "http://www.example.com/");

//initialize the array
$urls = array();

//if pdf is in the link then we have a valid file
if($end !== false){

//if the begenning part is also in it
if($start !== false){

//you would need to play around with things here to get the exact file name
$end = $end + 2;

$filename = substr($i, $start, ($end - 5));

//add these to our array
array_push($urls, $filename);

}//if start

}//if end

//now we are going to download the pdfs
$mh = curl_multi_init();
foreach ($urls as $i => $url) {

//put the path where you want them to go here
$g="thePathWhereYouWantThePdfs\\".basename($url);
if(!is_file($g)){
$conn[$i]=curl_init($url);
$fp[$i]=fopen ($g, "w");
curl_setopt ($conn[$i], CURLOPT_FILE, $fp[$i]);
curl_setopt ($conn[$i], CURLOPT_HEADER ,0);
curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,60);
curl_multi_add_handle ($mh,$conn[$i]);
}
}
do {
$n=curl_multi_exec($mh,$active);
}
while ($active);
foreach ($urls as $i => $url) {
curl_multi_remove_handle($mh,$conn[$i]);
curl_close($conn[$i]);
fclose ($fp[$i]);
}
curl_multi_close($mh);

}//foreach