wget or curl?
wget
is a tool to download contents from the command line.
In its basic form, it allows downloading a file quite easily just by typing wget <url>
in your favorite terminal.
However, a simple look to the man page directly shows how powerful this tool is.
Similarily, curl
is another tool to handle internet requests, however, a look at the man page shows that it supports more protocols than wget
which only handles https(s) and ftp requests.
On the other hand, wget
can follow links (recursively), apply filters on your requests, transform relative links,…
Thus, they don't cover the same area of usage (even if the intersection is non-empty).
To put it short wget
will prove useful whenever you have to download a part of a website while exploring links, while curl can be very handy to tweak single requests in an atomic fashion.
Moreover, if you want to analyze web information, firefox and chromium (I didn't try on other browsers) allows exporting requests directly as a curl command from the web inspector, which makes the job less painful than with netcat.
To conclude, I'm definitely not a wget
/curl
poweruser, so there may be very basic stuff here, but as I'm not using those tools on a daily basis.
Anyway, as I said, this section is to help me remember these commands to reduce my google requests.
wget
Download a full repository
Download a repository selecting specific files
wget --recursive --no-parent --no-host-directories --cut-dirs=<n> --accept <extension list> <url>
Where <n>
denotes the number of subdirectories to omit from saving. For instance, to download the cover images from this blog at the address “https://blog.epheme.re/images/covers/”, you can put:
wget -rnpnH --cut-dirs=2 -A jpg https://blog.epheme.re/
Anyhow, a simpler method, if you don't need the directory structure (for instance in the above example), is to use the --no-directories
/-nd
option. However, the cut-dirs can be useful if you need some architecture information (e.g., if the files are sorted in directories by date or categories)
To reject some documents, you can also use the option -R
, which also accepts regular expressions (which type can be specified using --regex-type)
Mirror a website
Another useful use of wget
is just to make a local copy of a website. To do this, the long version is:
wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
The name of options are quite straightforward, and the shorten version of it is: wget -mkEp -np <url>
Ignoring robots.txt
Sometimes, robots.txt forbids you the access to some resources. You can easily bypass this with the option -e robots=off
.
Number of tries
Occasionally, when the server is busy answering you, wget
will try again and again (20 times by default), which can slower your mirroring quite a bit (especially if the timeout is big). You can lower this bound using the… --tries
/-t
option.
Finding 404 on a website
Using the --spider
option to not actually download files, you can use it as a debugger for your website with --output-file
/-o
to log the result in a file.
wget --spider -r -nd -o <logfile> <url>
The list of broken links is then summarized at the end of the log file.
Curl
Send a POST request
My most frequent use of curl
is to send POST requests to different kind of API, the syntax is quite simple using the --form
/-F
option:
curl -F <field1>=<content1> -F <field2>=<content2> <url>
Note that to send a file, precede the filename with an @
:
curl -F picture=@face.jpg <url>