Archive Million Pages With wget In Minutes
Introduction
webrecorder, heritrix, nutch, scrapy, colly, frontera are popular tools for large scale web crawling and archiving.
These tools require some learning curve and some of them don't have inbuilt support for warc(Web ARChive) output format.
wget
comes bundled with most *nix systems and has inbuilt support for warc output. In this article we will see how to quickly archive web pages with wget.
Archiving with wget
In previous article we have extracted a superset of top 1 million domains. We can use that list or urls to archive. Save this list to a file called urls.txt
.
This can be archived with the following command.
file=urls.txt wget -i $file --warc-file=$file -t 3 --timeout=4 -q -o /dev/null -O /dev/null
wget has the ability to continue partially downloaded files. But this option won't work with warc output. So, it is better to split this list into small chunks and process them. One added advantage of this approach is we can parallely download multiple chunks with wget.
mkdir -p chunks split -l 1000 urls.txt chunks/ -d --additional-suffix=.txt -a 3
This will split the file into several chunks each containing 1000 urls. wget doesn't have multithreading support. We can write a for loop to schedule a seperate process for each chunk.
for file in `ls -r chunks/*.txt` do wget -i $file --warc-file=$file -t 3 --timeout=4 -q -o /dev/null -O /dev/null & done
To archive 1000 urls, it takes ~15 minutes. In less than 20 minutes, it will download entire million pages.
Also, each process takes ~8MB of memory. To run 1000 process, a system needs 8GB+ memory. Otherwise, number of parallel processes should be reduced which increases overall run time.
Each archive chunk will be ~150MB and consume lot of storage. All downloaded acrhives can be zipped to reduce storage.
gzip *.warc
Here is an idempotent shell script to download and archive files in batches.
#! /bin/sh set -x batch=1000 size=`expr ${#batch} - 1` maxproc=50 file=urls.txt dir=$HOME'/projects/chunks'$batch mkdir -p $dir split -l $batch $file $dir'/' -d --additional-suffix=.txt -a $size sleep 1 useragent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' for file in `ls -r $dir/*.txt` do warcfile=$file'.warc' warczip=$warcfile'.gz' if [ -f $warczip ] || [ -f $warcfile ]; then continue fi if [ $(pgrep wget -c) -lt $maxproc ]; then echo $file wget -H "user-agent: $useragent" -i $file --warc-file=$file -t 3 --timeout=4 -q -o /dev/null -O /dev/null & sleep 2 else sleep 300 for filename in `find $dir -name '*.warc' -mmin +5` do gzip $filename -9 done fi done
Conclusion
In this article, we have seen how to archive million pages with wget in few minutes.
wget2 has multithreading support and it might have warc output soon. With that, archiving with wget becomes much easier.
Need further help with this? I am available for hire.