As part of migrating this website from Hugo to Ghost, I wanted to quickly check what links are broken so that I can fix them.
wget isn't the best tool for the job, but it does offer a pretty quick sanity check. Using the --spider option causes wget to crawl the target url (or html file [1].)
Crawl a target url and save the output to a log file:
$ wget --spider -o wget-justyn.io.log -e robots=off -rp https://justyn.io/
-rp
downloads files recursively and also downloads supporting urls like css files.
From that log file, do a quick check to see what non-200 status codes were returned:
$ grep 'HTTP request sent,' wget-justyn.io.log | grep -v '200'
This should provide a quick idea of whether you have any major problems or not. You can investigate further with:
$ grep -A4 "\-\-" wget-justyn.io.log | less
In less, use /
to search for specific status codes like /404
.
Alternatives
Google "broken link checker" or something similar to get lots of 3rd party services that offer tools to search for broken links. One that I tried is http://validator.w3.org/
References
wget's
--spider
option is useful for checking an exported local html file containing bookmarks from a browser. ↩︎