wget/curl

Posted by Fabrice on Monday, July 25, 2022 Translation: fr

wget or curl?

wget is a tool to download contents from the command line. In its basic form, it allows downloading a file quite easily just by typing wget <url> in your favorite terminal.

However, a simple look to the man page directly shows how powerful this tool is.

Similarily, curl is another tool to handle internet requests, however, a look at the man page shows that it supports more protocols than wget which only handles https(s) and ftp requests.

On the other hand, wget can follow links (recursively), apply filters on your requests, transform relative links,… Thus, they don't cover the same area of usage (even if the intersection is non-empty). To put it short wget will prove useful whenever you have to download a part of a website while exploring links, while curl can be very handy to tweak single requests in an atomic fashion. Moreover, if you want to analyze web information, firefox and chromium (I didn't try on other browsers) allows exporting requests directly as a curl command from the web inspector, which makes the job less painful than with netcat.

To conclude, I'm definitely not a wget/curl poweruser, so there may be very basic stuff here, but as I'm not using those tools on a daily basis. Anyway, as I said, this section is to help me remember these commands to reduce my google requests.

wget

Download a full repository

Download a repository selecting specific files

wget --recursive --no-parent --no-host-directories --cut-dirs=<n> --accept <extension list> <url>

Where <n> denotes the number of subdirectories to omit from saving. For instance, to download the cover images from this blog at the address “https://blog.epheme.re/images/covers/”, you can put:

wget -rnpnH --cut-dirs=2 -A jpg https://blog.epheme.re/

Anyhow, a simpler method, if you don't need the directory structure (for instance in the above example), is to use the --no-directories/-nd option. However, the cut-dirs can be useful if you need some architecture information (e.g., if the files are sorted in directories by date or categories) To reject some documents, you can also use the option -R, which also accepts regular expressions (which type can be specified using --regex-type)

Mirror a website

Another useful use of wget is just to make a local copy of a website. To do this, the long version is:

wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>

The name of options are quite straightforward, and the shorten version of it is: wget -mkEp -np <url>

Ignoring robots.txt

Sometimes, robots.txt forbids you the access to some resources. You can easily bypass this with the option -e robots=off.

Number of tries

Occasionally, when the server is busy answering you, wget will try again and again (20 times by default), which can slower your mirroring quite a bit (especially if the timeout is big). You can lower this bound using the… --tries/-t option.

Finding 404 on a website

Using the --spider option to not actually download files, you can use it as a debugger for your website with --output-file/-o to log the result in a file.

wget --spider -r -nd -o <logfile> <url>

The list of broken links is then summarized at the end of the log file.

Curl

Send a POST request

My most frequent use of curl is to send POST requests to different kind of API, the syntax is quite simple using the --form/-F option:

curl -F <field1>=<content1> -F <field2>=<content2> <url>

Note that to send a file, precede the filename with an @:

curl -F picture=@face.jpg <url>

tags: wget, curl