How To Install Private Python Packages With Pip


To distribute python code, we need to package it and host it somewhere, so that users can install and use it. If the code is public, it can be published to PyPi or any public repository, so that anyone can access it. If the code is private, we need to provide proper authentication mechanism before allowing users to access it.

In this article, we will see how to use pip to install Python packages hosted on GitLab, GitHub, Bitbucket or any other services.


To package python project, we need to create setup.py file which is build script for setuptools. Below is a sample setup file to create a package named library.

import setuptools

with open("README.md", "r") as fh:
    long_description = fh.read()

    description="A simple python package",
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent",

Python provides detailed packaging documentation on structuring and building the package.


Once module(s) is packaged and pushed to hosting service, it can be installed with pip.

# using https
$ pip install git+https://github.com/chillaranand/library.git

# using ssh
pip install git+ssh://git@github.com/chillaranand/library.git

This usually requires authentication with usersname/password or ssh key. This setup works for developement machines. To use it in CI/CD pipelines or as a dependency, we can use tokens to simplify installation.

$ export GITHUB_TOKEN=foobar

$ pip install git+https://$GITHUB_TOKEN@github.com/chillaranand/library.git


In this article, we have seen how to package python code and install private packages with pip. This makes it easy to manage dependencies or install packages on multiple machines.

Django Tips & Tricks #11 - Finding High-impact Performance Bottlenecks


When optimizing performance of web application, a common mistake is to start with optimizing the slowest page(or API). In addition to considering response time, we should also consider the traffic it is receving to priorotize the order of optimization.

In this article we will profile a django webapp, find high-impact performance bottlenecks and then start optimization them to yield better performance.


django-silk is an open source profiling tool which intercepts and stores HTTP requests data. Install it with pip.

pip install django-silk

Add silk to installed apps and include silk middleware in django settings.



Run migrations so that Silk can create required database tables to store profile data.

$ python manage.py makemigrations
$ python manage.py migrate
$ python manage.py collectstatic

Include silk urls in root urlconf to view the profile data.

urlpatterns += [url(r'^silk/', include('silk.urls', namespace='silk'))]

On silk requests page(http://host/silk/requests/), we can see all requests and sort them by overall time or time spent in database.

High Impact Bottlenecks

Silk creates silk_request table which contains information about the requests processed by django.

$ pgcli

library> \d silk_request;

| Column             | Type                     | Modifiers   |
| id                 | character varying(36)    |  not null   |
| path               | character varying(190)   |  not null   |
| time_taken         | double precision         |  not null   |

We can group these requests data by path, calculate number of requests, average time taken and impact factor of each path. Since we are considering response time and traffic, impact factor will be product of average response time and number of requests for that path.

library> SELECT
     s.*, round((s.avg_time * s.count)/max(s.avg_time*s.count) over ()::NUMERIC,2) as impact
     (select path, round(avg(time_taken)::numeric,2) as avg_time, count(path) as count from silk_request group by PATH)
 ORDER BY impact DESC;

| path                    | avg_time   | count   | impact   |
| /point/book/book/       | 239.90     | 1400    | 1.00     |
| /point/book/data/       | 94.81      | 1900    | 0.54     |
| /point/                 | 152.49     | 900     | 0.41     |
| /point/login/           | 307.03     | 400     | 0.37     |
| /                       | 106.51     | 1000    | 0.32     |
| /point/auth/user/       | 494.11     | 200     | 0.29     |

We can see /point/book/book/ has highest impact even though it is neighter most visited nor slowest view. Optimizing this view first yields in overall better performance of webapp.


In this article, we learnt how to profile django webapp and identify bottlenecks to improve performance. In the next article, we wil learn how to optimize these bottlenecks by taking an in-depth look at them.

Archive Million Pages With wget In Minutes


webrecorder, heritrix, nutch, scrapy, colly, frontera are popular tools for large scale web crawling and archiving.

These tools require some learning curve and some of them don't have inbuilt support for warc(Web ARChive) output format.

wget comes bundled with most *nix systems and has inbuilt support for warc output. In this article we will see how to quickly archive web pages with wget.

Archiving with wget

In previous article we have extracted a superset of top 1 million domains. We can use that list or urls to archive. Save this list to a file called urls.txt.

This can be archived with the following command.

wget -i $file --warc-file=$file -t 3 --timeout=4 -q -o /dev/null -O /dev/null

wget has the ability to continue partially downloaded files. But this option won't work with warc output. So, it is better to split this list into small chunks and process them. One added advantage of this approach is we can parallely download multiple chunks with wget.

mkdir -p chunks
split -l 1000 urls.txt chunks/ -d --additional-suffix=.txt -a 3

This will split the file into several chunks each containing 1000 urls. wget doesn't have multithreading support. We can write a for loop to schedule a seperate process for each chunk.

for file in `ls -r chunks/*.txt`
   wget -i $file --warc-file=$file -t 3 --timeout=4 -q -o /dev/null -O /dev/null &

To archive 1000 urls, it takes ~15 minutes. In less than 20 minutes, it will download entire million pages.

Also, each process takes ~8MB of memory. To run 1000 process, a system needs 8GB+ memory. Otherwise, number of parallel processes should be reduced which increases overall run time.

Each archive chunk will be ~150MB and consume lot of storage. All downloaded acrhives can be zipped to reduce storage.

gzip *.warc

Here is an idempotent shell script to download and archive files in batches.

#! /bin/sh

set -x

size=`expr ${#batch} - 1`

mkdir -p $dir
split -l $batch $file $dir'/' -d --additional-suffix=.txt -a $size
sleep 1

useragent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

for file in `ls -r $dir/*.txt`
    if [ -f $warczip ] || [ -f $warcfile ]; then

    if [ $(pgrep wget -c) -lt $maxproc ]; then
        echo $file
        wget -H "user-agent: $useragent" -i $file --warc-file=$file -t 3 --timeout=4 -q -o /dev/null -O /dev/null &
        sleep 2
        sleep 300
        for filename in `find $dir -name '*.warc' -mmin +5`
            gzip $filename -9


In this article, we have seen how to archive million pages with wget in few minutes.

wget2 has multithreading support and it might have warc output soon. With that, archiving with wget becomes much easier.

Comparision Of Alexa, Majestic & Domcop Top Million Sites


Alexa, Majestic & Domcop(based on CommonCrawl data) provide top 1 million popular websites based on their analytics. In this article we will download this data and compare them using Linux command line tools.

Collecting data

Lets download data from above sources and extract domain names. The data format is different for each source. We can use awk tool to extract domains column from the source. After extracting data, sort it and save it to a file.

# alexa

$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

$ unzip top-1m.csv.zip

# data sorted by ranking
$ head -n 5 top-1m.csv

$ awk -F "," '{print $2}' top-1m.csv | sort > alexa

# domains after sorting alphabetically
$ head -n 5 alexa
# Domcop

$ wget https://www.domcop.com/files/top/top10milliondomains.csv.zip

$ unzip top10milliondomains.csv.zip

# data sorted by ranking
$ head -n 5 top10milliondomains.csv
"Rank","Domain","Open Page Rank"

$ awk -F "\"*,\"*" '{if(NR>1)print $2}' top10milliondomains.csv.zip | sort > domcop

# domains after sorting alphabetically
$ head -n 5 domcop
# Majestic

$ wget http://downloads.majestic.com/majestic_million.csv

# data sorted by ranking
$ head -n 5 majestic_million.csv

$ awk -F "\"*,\"*" '{if(NR>1)print $2}' majestic_million.csv | sort > majestic

# domains after sorting alphabetically
$ head -n 5 majestic

Comparing Data

We have collected and extracted domains from above sources. Lets compare the domains to see how similar they are using comm.

$ comm -123 alexa domcop --total
871851  871851  128149  total

$ comm -123 alexa majestic --total
788454  788454  211546  total

$ comm -123 domcop majestic --total
784388  784388  215612  total
$ comm -12 alexa domcop | comm -123 - majestic --total
31314   903165  96835   total

So, only 96,835(9.6%) domains are common between all the datasets and the overlap between any two sources is ~20%. Here is a venn diagram showing the overlap between them.


We have collected data from alexa, domcorp & majestic, extracted domains from it and observed that there is only a small overlap between them.

Setup Continous Deployment For Python Chalice


Chalice is a microframework developed by Amazon for quickly creating and deploying serverless applications in Python.

In this article, we will see how to setup continous deployment with GitHub and AWS CodePipeline.

CD Setup

Chalice provides cli command deploy to deploy from local system.

Chalice also provides cli command generate-pipeline command to generate CloudFormation template. This template is useful to automatically generate several resources required for AWS pipeline.

This by default uses CodeCommit repository for hosting code. We can use GitHub repo as a source instead of CodeCommit.

Chalice by default provides a build file to package code and push it to S3. In the deploy step, it uses this artifact to deploy the code.

We can use a custom buildpsec file to directly deploy the code from build step.

version: 0.1

      - echo Entering the install phase
      - echo Installing dependencies
      - sudo pip install --upgrade awscli
      - aws --version
      - sudo pip install chalice
      - sudo pip install -r requirements.txt

      - echo entered the build phase
      - echo Build started on `date`
      - chalice deploy --stage staging

This buildspec file install requirements and deploys chalice app to staging. We can add one more build step to deploy it production after manual intervention.


We have seen how to setup continous deployment for chalice application with GitHub and AWS CodePipeline.

Parsing & Transforming mitmproxy Request Flows

mitmproxy is a free and open source interactive HTTPS proxy. It provides command-line interface, web interface and Python API for interaction and customizing it for our needs.

mitmproxy provides an option to export web request flows to curl/httpie/raw formats. From mitmproxy, we can press e(export) and then we can select format for exporting.

Exporting multiple requests with this interface becomes tedious. Instead we can save all requests to a file and write a python script to export them.

Start mitmproxy with this command so that all request flows are appended to requests.mitm file for later use.

$ mitmproxy -w +requests.mitm

Here is a python script to parse this dump file and print request URLs.

from mitmproxy.io import FlowReader

filename = 'requests.mitm'

with open(filename, 'rb') as fp:
    reader = FlowReader(fp)

    for flow in reader.stream():

flow.request object has more attributes to provide information about the request.

In [31]: dir(flow.request)

We can use the mitmproxy export utilities to transform mitm flows to other formats.

In [32]: flow = next(reader.stream())

In [33]: from mitmproxy.addons import export

In [34]: export.curl_command(flow)
Out[34]: "curl -H 'Host:mitm.it' -H 'Proxy-Connection:keep-alive' -H 'User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' -H 'DNT:1' -H 'Accept:image/webp,image/apng,image/*,*/*;q=0.8' -H 'Referer:http://mitm.it/' -H 'Accept-Encoding:gzip, deflate' -H 'Accept-Language:en-US,en;q=0.9,ms;q=0.8,te;q=0.7' -H 'content-length:0' 'http://mitm.it/favicon.ico'"

In [35]: export.raw(flow)
Out[35]: b'GET /favicon.ico HTTP/1.1\r\nHost: mitm.it\r\nProxy-Connection: keep-alive\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\r\nDNT: 1\r\nAccept: image/webp,image/apng,image/*,*/*;q=0.8\r\nReferer: http://mitm.it/\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: en-US,en;q=0.9,ms;q=0.8,te;q=0.7\r\n\r\n'

In [36]: export.httpie_command(flow)
Out[36]: "http GET http://mitm.it/favicon.ico 'Host:mitm.it' 'Proxy-Connection:keep-alive' 'User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 'DNT:1' 'Accept:image/webp,image/apng,image/*,*/*;q=0.8' 'Referer:http://mitm.it/' 'Accept-Encoding:gzip, deflate' 'Accept-Language:en-US,en;q=0.9,ms;q=0.8,te;q=0.7' 'content-length:0'"

With these utilities we can transform mitmproxy request flow to curl command or any other custom form to fit our needs.

Linux Performance Analysis In Less Than 10 Seconds

If you are using a Linux System or managing a Linux server, you might come across a situation where a process is taking too long to complete. In this article we will see how to track down such performance issues in Linux.

Netflix TechBlog has an article on how to anlyze Linux performance in 60 seconds. This article provides 10+ tools to use in order to see the resource usage and pinpoint the bottleneck.

It is strenuous to remember all those tools/options and laborious to run all those commands when working on multiple systems.

Instead, we can use atop, a tool for one stop solution for performance analysis. Here is a comparision of atop with other tools from LWN.

atop shows live & historical data measurement at system level as well as process level. To get the glimpse of system resource(CPU, memory, network, disk) usage install and run atop with

$ sudo apt install --yes atop

$ atop

By default, atop shows resources used in the last interval only and sorts them by CPU usage. We can use

$ atop -A -f 4

-A sorts the processes automatically in the order of the most busy system resource.

-f shows both active as well as inactive system resources in the ouput.

4 sets refresh interval to 4 seconds.

Just by looking at the output of atop, we get a glimpse of overall system resource usage as well as individual processes resource usage.

Django Tips & Tricks #10 - Log SQL Queries To Console

Django ORM makes easy to interact with database. To understand what is happening behing the scenes or to see SQL performance, we can log all the SQL queries that be being executed. In this article, we will see various ways to achieve this.

Using debug-toolbar

Django debug toolbar provides panels to show debug information about requests. It has SQL panel which shows all executed SQL queries and time taken for them.

When building REST APIs or micro services where django templating engine is not used, this method won't work. In these situations, we have to log SQL queries to console.

Using django-extensions

Django-extensions provides lot of utilities for productive development. For runserver_plus and shell_plus commands, it accepts and optional --print-sql argument, which prints all the SQL queries that are being executed.

./manage.py runserver_plus --print-sql
./manage.py shell_plus --print-sql

Whenever an SQL query gets executed, it prints the query and time taken for it in console.

In [42]: User.objects.filter(is_staff=True)
Out[42]: SELECT "auth_user"."id",
  FROM "auth_user"
 WHERE "auth_user"."is_staff" = true

Execution time: 0.002107s [Database: default]

<QuerySet [<User: anand>, <User: chillar>]>

Using django-querycount

Django-querycount provides a middleware to show SQL query count and show duplicate queries on console.

| Type | Database  |   Reads  |  Writes  |  Totals  | Duplicates |
| RESP |  default  |    3     |    0     |    3     |     1      |
Total queries: 3 in 1.7738s

Repeated 1 times.
SELECT "django_session"."session_key",
"django_session"."session_data", "django_session"."expire_date" FROM
"django_session" WHERE ("django_session"."session_key" =
'dummy_key AND "django_session"."expire_date"
> '2018-05-31T09:38:56.369469+00:00'::timestamptz)

This package provides additional settings to customize output.

Django logging

Instead of using any 3rd party package, we can use django.db.backends logger to print all the SQL queries.

Add django.db.backends to loggers list and set log level and handlers.

    'loggers': {
        'django.db.backends': {
            'level': 'DEBUG',
            'handlers': ['console', ],

In runserver console, we can see all SQL queries that are being executed.

(0.001) SELECT "django_admin_log"."id", "django_admin_log"."action_time", "django_admin_log"."user_id", "django_admin_log"."content_type_id", "django_admin_log"."object_id", "django_admin_log"."object_repr", "django_admin_log"."action_flag", "django_admin_log"."change_message", "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined", "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_admin_log" INNER JOIN "auth_user" ON ("django_admin_log"."user_id" = "auth_user"."id") LEFT OUTER JOIN "django_content_type" ON ("django_admin_log"."content_type_id" = "django_content_type"."id") WHERE "django_admin_log"."user_id" = 4 ORDER BY "django_admin_log"."action_time" DESC LIMIT 10; args=(4,)
[2018/06/03 15:06:59] HTTP GET /admin/ 200 [1.69,]

These are few ways to log all SQL queries to console. We can also write a custom middleware for better logging of these queries and get some insights.

How To Deploy Django Channels To Production

In this article, we will see how to deploy django channels to production and how we can scale it to handle more load. We will be using nginx as proxy server, daphne as ASGI server, gunicorn as WSGI server and redis for channel back-end.

Daphne can serve HTTP requests as well as WebSocket requests. For stability and performance, we will use uwsgi/gunicorn to serve HTTP requests and daphne to serve websocket requests.

We will be using systemd to create and manage processes instead of depending on third party process managers like supervisor or circus. We will be using ansible for managing deployments. If you don't want to use ansible, you can just replace template variables in the following files with actual values.

Nginx Setup

Nginx will be routing requests to WSGI server and ASGI server based on URL. Here is nginx configuration for server.

server {
    listen {{ server_name }}:80;
    server_name {{ server_name }} www.{{ server_name }};

    return 301 https://avilpage.com$request_uri;

server {
    listen {{ server_name }}:443 ssl;
    server_name {{ server_name }} www.{{ server_name }};

    ssl_certificate     /root/certs/avilpage.com.chain.crt;
    ssl_certificate_key /root/certs/avilpage.com.key;

    access_log /var/log/nginx/avilpage.com.access.log;
    error_log /var/log/nginx/avilpage.com.error.log;

    location / {
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header Host $http_host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_redirect off;

    location /ws/ {
            proxy_http_version 1.1;

            proxy_read_timeout 86400;
            proxy_redirect     off;

            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Host $server_name;

    location /static {
        alias {{ project_root }}/static;

    location  /favicon.ico {
        alias {{ project_root }}//static/img/favicon.ico;

    location  /robots.txt {
        alias {{ project_root }}/static/txt/robots.txt;


WSGI Server Setup

We will use gunicorn for wsgi server. We can run gunicorn with

$ gunicorn avilpage.wsgi --bind --log-level error --log-file=- --settings avilpage.production_settings

We can create a systemd unit file to make it as a service.


WorkingDirectory={{ project_root }}
Environment="DJANGO_SETTINGS_MODULE={{ project_name }}.production_settings"
ExecStart={{ venv_bin }}/gunicorn {{ project_name}}.wsgi --bind --log-level error --log-file=- --workers 5 --preload

ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID


Whenever server restarts, systemd will automatically start gunicorn service. We can also restart gunicorn manually with

$ sudo service gunicorn restart

ASGI Server Setup

We will use daphne for ASGI server and it can be started with

$ daphne avilpage.asgi:application --bind --port 9000 --verbosity 1

We can create a systemd unit file like the previous one to create a service.

Description=daphne daemon

WorkingDirectory={{ project_root }}
Environment="DJANGO_SETTINGS_MODULE={{ project_name }}.production_settings"
ExecStart={{ venv_bin }}/daphne --bind --port 9000 --verbosity 0 {{project_name}}.asgi:application
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID



Here is an ansible playbook which is used to deploy these config files to our server. To run the playbook on server avilpage.com, execute

$ ansible-playbook -i avilpage.com, django_setup.yml


Now that we have deployed channels to production, we can do performance test to see how our server performs under load.

For WebSockets, we can use Thor to run performance test.

thor -C 100 -A 1000 wss://avilpage.com/ws/books/

Our server is able to handle 100 requests per second with a latency of 800ms. This is good enough for low traffic website.

To improve performance, we can use unix sockets instead of rip/port for gunicorn and daphne. Also, daphne has support for multiprocessing using shared file descriptors. Unfortunately, it doesn't work as expected. As mentioned here, we can use systemd templates and spawn multiple daphne process.

An alternate way is to use uvicorn to start multiple workers. Install uvicorn using pip

$ pip install uvicorn

Start uvicorn ASGI server with

$ uvicorn avilpage.asgi --log-level critical --workers 4

This will spin up 4 workers which should be able to handle more load. If this performance is not sufficient, we have to setup a load balancer and spin up multiple servers(just like scaling any other web application).

Reliable Way To Test External APIs Without Mocking

Let us write a function which retrieves user information from GitHub API.

import requests

def get_github_user_info(username):
    url = f'https://api.github.com/users/{username}'
    response = requests.get(url)
    if response.ok:
        return response.json()
        return None

To test this function, we can write a test case to call the external API and check if it is returning valid data.

def test_get_github_user_info():
    username = 'ChillarAnand'
    info = get_github_user_info(username)
    assert info is not None
    assert username == info['login']

Even though this test case is reliable, this won't be efficient when we have many APIs to test as it sends unwanted requests to external API and makes tests slower due to I/O.

A widely used solution to avoid external API calls is mocking. Instead of getting the response from external API, use a mock object which returns similar data.

from unittest import mock

def test_get_github_user_info_with_mock():
    with mock.patch('requests.get') as mock_get:
        username = 'ChillarAnand'

        mock_get.return_value.ok = True
        json_response = {"login": username}
        mock_get.return_value.json.return_value = json_response

        info = get_github_user_info(username)

        assert info is not None
        assert username == info['login']

This solves above problems but creates additional problems.

  • Unreliable. Even though test cases pass, we are not sure if API is up and is returning a valid response.
  • Maintenance. We need to ensure mock responses are up to date with API.

To avoid this, we can cache the responses using requests-cache.

import requests_cache


def test_get_github_user_info_without_mock():
    username = 'ChillarAnand'
    info = get_github_user_info(username)
    assert info is not None
    assert username == info['login']

When running tests from developer machine, it will call the API for the first time and uses the cached response for subsequent API calls. On CI pipeline, it will hit the external API as there won't be any cache.

When the response from external API changes, we need to invalidate the cache. Even if we miss cache invalidation, test cases will fail in CI pipeline before going into production.