Concurrent Downloads - Bash Vs Python
I just found one more free telugu book
Graded
readings in modern literary Telugu
by Golla Narayanaswami Reddy and Dan M Matson in Digital South Asia Library.
Unfortunately they didn't provide it as an ebook but as a set of 221 tif images.
I wrote a simple for loop in shell which downloaded all images one by one using
wget
.
$ base_url="http://dsal.uchicago.edu"
$ url="$base_url/digbooks/images/PL4775.R4_1967/PL4775.R4_1967_%03g.gif"
$ time -p sh -c 'for i in $(seq -f $url 1 221); do; wget $i; done;'
I took 375 seconds for that. This was too slow. So I tried to download them parallelly using xargs.
$ time echo $(seq -f $url 1 221) | xargs -n 1 -P 36 wget
My laptop has a quad core processor. So I tried with 20, 24, 28, 32 process at a time.
With
wget+xargs
,
the best timing is 13 seconds (CPU: 15%, Process: 28).
Again I tried downloading them parallelly but with GNU parallel.
$ time seq -f $url 1 221 | parallel -j36 wget {}
With
wget+parallel
,
the best timing is 12 seconds (CPU: 48%, Process: 24).
Here is cpu consumption and time taken at each step.
Once I have done with bash, I decided to try the same things with Python and see how it goes.
I wrote a simple script using requests to download images.
import shutil
import sys
from concurrent import futures
import requests
def download_image(url):
r = requests.get(url)
file_name = url.split('/')[-1]
with open(file_name, 'wb') as fh:
fh.write(r.content)
base_url = 'http://dsal.uchicago.edu'
book_url = base_url + '/digbooks/images/PL4775.R4_1967/PL4775.R4_1967_{}.gif'
urls = [book_url.format(str(i).zfill(3)) for i in range(1, 221)]
def download_serially():
for url in urls:
download_image(url)
download_serially()
This took 244 seconds.
To download images parallelly, I have used
Threadpoolexecutor
from concurrent module.
def download_parallely():
workers = int(sys.argv[1])
with futures.ThreadPoolExecutor(max_workers=workers) as executor:
result = executor.map(download_image, urls)
download_parallely()
I used previous script but just added one more function which queues tasks. Then I have executed the script with several options.
$ time python down.py 28
Threadpoolexecutor documentation uses 5 times number of processors as max_workers by default. I tried same options which I have used for bash. Here is
the overall comparision.
With
requests+ThreadPoolExecutor
,
the best timing is 12 seconds (CPU: 36%, Process: 28).
Here is the overall comparision.
For a simple concurrent download, xargs+wget seems to be the best option.
Need further help with this? I am available for hire.