Concurrent Downloads - Bash Vs Python

Anand Reddy Pandikunta

14 min read

2016-05-29

I just found one more free telugu book

Graded
      readings in modern literary Telugu

by Golla Narayanaswami Reddy and Dan M Matson in Digital South Asia Library.

Unfortunately they didn't provide it as an ebook but as a set of 221 tif images.

I wrote a simple for loop in shell which downloaded all images one by one using wget.

      $ base_url="http://dsal.uchicago.edu"
$ url="$base_url/digbooks/images/PL4775.R4_1967/PL4775.R4_1967_%03g.gif"
$ time -p sh -c 'for i in $(seq -f $url 1 221); do; wget $i; done;'

I took 375 seconds for that. This was too slow. So I tried to download them parallelly using xargs.

      $ time echo $(seq -f $url 1 221) | xargs -n 1 -P 36 wget

My laptop has a quad core processor. So I tried with 20, 24, 28, 32 process at a time.

With wget+xargs, the best timing is 13 seconds (CPU: 15%, Process: 28).

Again I tried downloading them parallelly but with GNU parallel.

      $ time seq -f $url 1 221 | parallel -j36 wget {}

With wget+parallel, the best timing is 12 seconds (CPU: 48%, Process: 24).

Here is cpu consumption and time taken at each step.

Once I have done with bash, I decided to try the same things with Python and see how it goes.

I wrote a simple script using requests to download images.

                    import shutil
import sys
from concurrent import futures

import requests


def download_image(url):
    r = requests.get(url)
    file_name = url.split('/')[-1]
with open(file_name, 'wb') as fh:
        fh.write(r.content)


base_url = 'http://dsal.uchicago.edu'
book_url = base_url + '/digbooks/images/PL4775.R4_1967/PL4775.R4_1967_{}.gif'
urls = [book_url.format(str(i).zfill(3)) for i in range(1, 221)]

def download_serially():
for url in urls:
        download_image(url)

download_serially()

This took 244 seconds.

To download images parallelly, I have used Threadpoolexecutor from concurrent module.

                    def download_parallely():
    workers = int(sys.argv[1])

with futures.ThreadPoolExecutor(max_workers=workers) as executor:
        result = executor.map(download_image, urls)

download_parallely()

I used previous script but just added one more function which queues tasks. Then I have executed the script with several options.

      $ time python down.py 28

Threadpoolexecutor documentation uses 5 times number of processors as max_workers by default. I tried same options which I have used for bash. Here is the overall comparision.

With requests+ThreadPoolExecutor, the best timing is 12 seconds (CPU: 36%, Process: 28).

Here is the overall comparision.

For a simple concurrent download, xargs+wget seems to be the best option.

Need further help with this? Feel free to send a message.

Anand Reddy Pandikunta (ChillarAnand)
Improving Health & Wealth with Technology