Forum Moderators: bakedjake
grep -oh 'http://[^"]*' *.html ¦ sort ¦ uniq
The regex may need modifications if your HTML uses a single quote in HREF's instead of a double quote, but this is about the fastest way I know to get a complete list without banging on the website with some cumbersome tool.
If you have your html files in a bunch of subdirectories not to fear, recursion is here!
grep -ohr 'http://[^"]*' *.html ¦ sort ¦ uniq
Note that I added an "r" to the grep options so it will check all the files in subdirectories.
For a homework assignment, check to see if your links are valid using "curl" to visit each site and record the results. ;)
I ever really got the hang of these sorts of command lines. I find scripts easier to understand, although I guess that they are harder to adapt to changing needs.