Archive Million Pages With wget In Minutes


Introduction

webrecorder, heritrix, nutch, scrapy, colly, frontera are popular tools for large scale web crawling and archiving.

These tools require some learning curve and some of them don't have inbuilt support for warc(Web ARChive) output format.

wget comes bundled with most *nix systems and has inbuilt support for warc output. In this article we will see how to quickly archive web pages with wget.

Archiving with wget

In previous article we have extracted a superset of top 1 million domains. We can use that list or urls to archive. Save this list to a file called urls.txt.

This can be archived with the following command.

file=urls.txt
wget -i $file --warc-file=$file -t 3 --timeout=4 -q -o /dev/null -O /dev/null

wget has the ability to continue partially downloaded files. But this option won't work with warc output. So, it is better to split this list into small chunks and process them. One added advantage of this approach is we can parallely download multiple chunks with wget.

mkdir -p chunks
split -l 1000 urls.txt chunks/ -d --additional-suffix=.txt -a 3

This will split the file into several chunks each containing 1000 urls. wget doesn't have multithreading support. We can write a for loop to schedule a seperate process for each chunk.

for file in `ls -r chunks/*.txt`
do
   wget -i $file --warc-file=$file -t 3 --timeout=4 -q -o /dev/null -O /dev/null &
done

To archive 1000 urls, it takes ~15 minutes. In less than 20 minutes, it will download entire million pages.

Also, each process takes ~8MB of memory. To run 1000 process, a system needs 8GB+ memory. Otherwise, number of parallel processes should be reduced which increases overall run time.

Each archive chunk will be ~150MB and consume lot of storage. All downloaded acrhives can be zipped to reduce storage.

gzip *.warc

Here is an idempotent shell script to download and archive files in batches.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#! /bin/sh

set -x

batch=1000
size=`expr ${#batch} - 1`
maxproc=50
file=urls.txt
dir=$HOME'/projects/chunks'$batch


mkdir -p $dir
split -l $batch $file $dir'/' -d --additional-suffix=.txt -a $size
sleep 1

useragent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'


for file in `ls -r $dir/*.txt`
do
    warcfile=$file'.warc'
    warczip=$warcfile'.gz'
    if [ -f $warczip ] || [ -f $warcfile ]; then
        continue
    fi

    if [ $(pgrep wget -c) -lt $maxproc ]; then
        echo $file
        wget -H "user-agent: $useragent" -i $file --warc-file=$file -t 3 --timeout=4 -q -o /dev/null -O /dev/null &
        sleep 2
    else
        sleep 300
        for filename in `find $dir -name '*.warc' -mmin +5`
        do
            gzip $filename -9
        done
    fi
done

Conclusion

In this article, we have seen how to archive million pages with wget in few minutes.

wget2 has multithreading support and it might have warc output soon. With that, archiving with wget becomes much easier.

Comments

Comments powered by Disqus