Quickly check for broken links using wget

As part of migrating this website from Hugo to Ghost, I wanted to quickly check what links are broken so that I can fix them.

wget isn't the best tool for the job, but it does offer a pretty quick sanity check. Using the --spider option causes wget to crawl the target url (or html file ^[1].)

Crawl a target url and save the output to a log file:

$ wget --spider -o wget-justyn.io.log -e robots=off -rp https://justyn.io/

-rp downloads files recursively and also downloads supporting urls like css files.

From that log file, do a quick check to see what non-200 status codes were returned:

$ grep 'HTTP request sent,' wget-justyn.io.log | grep -v '200'

This should provide a quick idea of whether you have any major problems or not. You can investigate further with:

$ grep -A4 "\-\-" wget-justyn.io.log | less

In less, use / to search for specific status codes like /404.

Alternatives

Google "broken link checker" or something similar to get lots of 3rd party services that offer tools to search for broken links. One that I tried is http://validator.w3.org/

References

https://linux.die.net/man/1/wget

wget's --spider option is useful for checking an exported local html file containing bookmarks from a browser. ↩︎

Quickly check for broken links using wget

Justyn

Justyn

Alternatives

References

Tmux pop-up cheat sheet

Merge two or more zsh history files

Easy Copy on Write

Copy files/directories from one Kubernetes pod to another Kubernetes pod

Upload files to Nextcloud using curl (and webdav)