Concurrent Downloads - Bash Vs Python

12 min read
I just found one more free telugu book Graded readings in modern literary Telugu by Golla Narayanaswami Reddy and Dan M Matson in Digital South Asia Library.
Unfortunately they didn't provide it as an ebook but as a set of 221 tif images.
I wrote a simple for loop in shell which downloaded all images one by one using wget.
$ base_url="http://dsal.uchicago.edu"
$ url="$base_url/digbooks/images/PL4775.R4_1967/PL4775.R4_1967_%03g.gif"
$ time -p sh -c 'for i in $(seq -f $url 1 221); do; wget $i; done;'
I took 375 seconds for that. This was too slow. So I tried to download them parallelly using xargs.
$ time echo $(seq -f $url 1 221) | xargs -n 1 -P 36 wget
My laptop has a quad core processor. So I tried with 20, 24, 28, 32 process at a time.
With wget+xargs, the best timing is 13 seconds (CPU: 15%, Process: 28).
Again I tried downloading them parallelly but with GNU parallel.
$ time seq -f $url 1 221 | parallel -j36 wget {}
With wget+parallel, the best timing is 12 seconds (CPU: 48%, Process: 24).
Here is cpu consumption and time taken at each step.
paralle_python_bash2
Once I have done with bash, I decided to try the same things with Python and see how it goes.
I wrote a simple script using requests to download images.
import shutil
import sys
from concurrent import futures

import requests


def download_image(url):
r = requests.get(url)
file_name = url.split('/')[-1]
with open(file_name, 'wb') as fh:
fh.write(r.content)


base_url = 'http://dsal.uchicago.edu'
book_url = base_url + '/digbooks/images/PL4775.R4_1967/PL4775.R4_1967_{}.gif'
urls = [book_url.format(str(i).zfill(3)) for i in range(1, 221)]

def download_serially():
for url in urls:
download_image(url)

download_serially()
This took 244 seconds.
To download images parallelly, I have used Threadpoolexecutor from concurrent module.
def download_parallely():
workers = int(sys.argv[1])

with futures.ThreadPoolExecutor(max_workers=workers) as executor:
result = executor.map(download_image, urls)

download_parallely()
I used previous script but just added one more function which queues tasks. Then I have executed the script with several options.
$ time python down.py 28
Threadpoolexecutor documentation uses 5 times number of processors as max_workers by default. I tried same options which I have used for bash. Here is the overall comparision.
With requests+ThreadPoolExecutor, the best timing is 12 seconds (CPU: 36%, Process: 28).
Here is the overall comparision.
paralle_python_bash
For a simple concurrent download, xargs+wget seems to be the best option.

Read more articles about Python.
Tags: python | terminal

I am Chillar Anand. I daydream a lot and write about the things that interest me here. You can read more about this blog here.

See all articles

RSS Feed for the blog

Edit this page

Comments

Comments powered by Disqus