Work From Home Tips For Non-remote Workers


Remote-first and remote-friendly companies have a different work culture & communication process compared to non-remote companies. Due to COVID-191 world wide pandemic, majority of workers who didn't had prior remote experience are WFH(working from home). This article is intended to provide some helpful tips for such people.

Work Desk

It is important to have a dedicated room or at least a desk for work. This creates a virtual boundary between your office work and personal work. Otherwise, you will end up working from bed, dining tables, kitchen etc which will result in body pains due to bad postures.

Get Ready

Image Credit: raywenderlich 2

Start your daily routine as if you are going to office. It is easy to stop caring about personal grooming and attire when WFH. Your attire can influence your focus and productivity.

If you find getting ready is hard, schedule video calls for all meetings with your colleagues. This might give some additional motivation for you to get ready early in the morning. When the work time, go to your work desk and start working.

Self Discipline

Schedule your work time. Whenever possible try to stick to office working hours. Without proper schedule, you will either end up under working or over working as your personal work and office work gets mixed up.

Take regular breaks during work hours. Without any distractions, it is easy to get lost in the pixels for longer durations. Taking short breaks for a quick walk and getting a fresh air outside will freshen up.

With unlimited access to kitchen and snacks, it is hard to avoid binge eating at home. But atleast avoid binge eating during office hours.

Exercise. Since WFH involves only sitting in a chair through out the day, staying physically active is challenging especially during this pandemic. Exercise few minutes every morning, help yourself in the kitchen by making a meal or doing dishes, clean your house etc., should help in staying physically active.

Seek Help

WFH can be lonely at times as the social interactions are quite less. Schedule 1 to 1 meetings or virtual coffe meetings with your colleagues to increase social interactions. Discuss WFH problems with your colleagues, friends and remote communities to see how they are tacking those problems.

How To Reduce Python Package Footprint?


PyPi1 hosts over 210K+ projects and the average size of Python package is less than 1MB. However some of the most used packages in scientific computing like NumPy, SciPy has large footprint as they bundle shared libraries2 along with the package.

Build From Source

If a project needs to be deployed in AWS Lambda, the total size of deployment package should be less than 250MB3.

$ pip install numpy

$ du -hs ~/.virtualenvs/py37/lib/python3.7/site-packages/numpy/
 85M    /Users/avilpage/.virtualenvs/all3/lib/python3.7/site-packages/numpy/

Just numpy occupies 85MB space on Mac machine. If we include a couple of other packages like scipy & pandas, overall size of the package crosses 250MB.

An easy way reduce the size of python packages is to build from source instead of use pre-compiled wheels.

$ CLFAGS='-g0 -Wl -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib' pip install numpy --global-option=build_ext

$ du -hs ~/.virtualenvs/py37/lib/python3.7/site-packages/numpy/
 23M    /Users/avilpage/.virtualenvs/all3/lib/python3.7/site-packages/numpy/

We can see the footprint has reduced by ~70% when using sdist instead of wheel. This4 article provides more details about these CFLAG optimization when installing a package from source.

Shared Packages

When using a laptop with low storage for multiple projects with conflicting dependencies, a seperate virtual environment is needed for each project. This will lead to installing same version of the package in multiple places which increases the footprint.

To avoid this, we can create a shared virtual environment which has most commonly used packages and share it across all the enviroments. For example, we can create a shared virtual enviroment with all the packages required for scientific computing.

For each project, we can create a virtual enviroment and share all packages of the common enviroment. If any project requires a specific version of the package, the same package can be install in project enviroment.

$ cat common-requirements.txt  # shared across all enviroments
numpy==1.18.1
pandas==1.0.1
scipy==1.4.1

$ cat project1-requirements.txt  # project1 requirements
numpy==1.18.1
pandas==1.0.0
scipy==1.4.1

$ cat project2-requirements.txt  # project2 enviroments
numpy==1.17
pandas==1.0.0
scipy==1.4.1

After creating a virtual enviroment for a project, we can create a .pth file with the path of site-packages of common virtual enviroment so that all those packages are readily available in the new project.

$ echo '/users/avilpage/.virtualenvs/common/lib/python3.7/site-packages' >
 ~/.virtualenvs/project1/lib/python3.7/site-packages/common.pth

Then we can install the project requirements which will install only missing packages.

$ pip install -r project1-requirements.txt

Global Store

The above shared packages solution has couple issues.

  1. User has to manually create and track shared packages for each Python version and needs to bootstrap it in every project.
  2. When there is an incompatible version of package in multiple projects, user will end up with duplicate installations of the same version.

To solve this5, we can have a global store of packages in a single location segregated by python and package version. Whenever a user tries to install a package, check if the package is in global store. If not install it in global store. If present, just link the package to virtualenvs.

For example, numpy1.17 for Python 3.7 and numpy1.18 for Python 3.6 can be stored in the global store as follows.

$ python3.6 -m pip install --target ~/.mpip/numpy/3.6_1.18 numpy

$ python3.7 -m pip install --target ~/.mpip/numpy/3.7_1.17 numpy

# in project venv
echo '~/.mpip/numpy/3.7_1.17' > PATH_TO_ENV/lib/python3.7/site-packages/numpy.pth

With this, we can ensure one version of the package is stored in the disk only once. I have created a simple package manager called mpip6 as a POC to test this and it seems to work as expected.

These are couple of ways to reduce to footprint of Python packages in a single environment as well as muliple enviroments.

Disabling Atomic Transactions In Django Test Cases


TestCase is the most used class for writing tests in Django. To make tests faster, it wraps all the tests in 2 nested atomic blocks.

In this article, we will see where these atomic blocks create problems and find ways to disable it.

Select for update

Django provides select_for_update() method on model managers which returns a queryset that will lock rows till the end of transaction.

def update_book(pk, name):
    with transaction.atomic():
        book = Book.objects.select_for_update().get(pk=pk)
        book.name = name
        book.save()

When writing test case for a piece of code that uses select_for_update, Django recomends not to use TestCase as it might not raise TransactionManagementError.

Threads

Let us take a view which uses threads to get data from database.

def get_books(*args):
    queryset = Book.objects.all()
    serializer = BookSerializer(queryset, many=True)
    response = serializer.data
    return response

class BookViewSet(ViewSet):

    def list(self, request):
        with ThreadPoolExecutor() as executor:
            future = executor.submit(get_books, ())
            return_value = future.result()
        return Response(return_value)

A test which writes some data to db and then calls this API will fail to fetch the data.

class LibraryPaidUserTestCase(TestCase):
    def test_get_books(self):
        Book.objects.create(name='test book')

        self.client = APIClient()
        url = reverse('books-list')
        response = self.client.post(url, data=data)
        assert response.json()

Threads in the view create a new connection to the database and they don't see the created test data as the transaction is not yet commited.

Transaction Test Case

To handle above 2 scenarios or other scenarios where database transaction behaviour needs to be tested, Django recommends to use TransactionTestCase instead of TestCase.

from django.test import TransactionTestCase

Class LibraryPaidUserTestCase(TransactionTestCase):
    def test_get_books(self):
        ...

With TransactionTestCase, db will be in auto commit mode and threads will be able to fetch the data commited earlier.

Consider a scenario, where there are other utility classes which are subclassed from TestCase.

class LibraryTestCase(TestCase):
    ...

class LibraryUserTestCase(LibraryTestCase):
    ...

class LibraryPaidUserTestCase(LibraryTestCase):
    ...

If we subclass LibraryTestCase with TransactionTestCase, it will slow down the entire test suite as all the tests run in autocommit mode.

If we subclass LibraryUserTestCase with TransactionTestCase, we will miss the functionality in LibraryTestCase. To prevent this, we can override the custom methods to call TransactionTestCase.

If we look at the source code of TestCase, it has 4 methods to handle atomic transactions. We can override them to prevent creation of atomic transactions.

class LibraryPaidUserTestCase(LibraryTestCase):
    @classmethod
    def setUpClass(cls):
        super(TestCase, cls).setUpClass()

    @classmethod
    def tearDownClass(cls):
        super(TestCase, cls).tearDownClass()

    def _fixture_setup(self):
        return super(TestCase, self)._fixture_setup()

    def _fixture_teardown(self):
        return super(TestCase, self)._fixture_teardown()

We can also create a mixin with the above methods and subclass it wherever this functionality is needed.

Conclusion

Django wraps tests in TestCase inside atomic transactions to speed up the run time. When we are testing for db transaction behaviours, we have to disable this using appropriate methods.

Mastering DICOM - Part #1


Introduction

In hospitals, PACS simplifies the clinical workflow by reducing physical and time barriers. A typical radiology workflow looks like this.

Credit: Wikipedia

A patient as per doctor's request will visit a radiology center to undergo CT/MRI/X-RAY. Data captured from modality(medical imaging equipments like CT/MRI machine) will be sent to QA for verfication and then sent to PACS for archiving.

After this when patient visits doctor, doctor can see this study on his workstation(which has DICOM viewer) by entering patient details.

In this series of articles, we will how to achieve this seamless transfer of medical data digitally with DICOM.

DICOM standard

DICOM modalities create files in DICOM format. This file has dicom header which contains meta data and dicom data set which has modality info(equipment information, equipment configuration etc), patient information(name, sex etc) and the image data.

Storing and retreiving DICOM files from PACS servers is generally achieved through DIMSE DICOM for desktop applications and DICOMWeb for web applications.

All the machines which transfer/receive DICOM data must follow DICOM standard. With this all the DICOM machines which are in a network can store and retrieve DICOM files from PACS.

When writing software to handle DICOM data, there are third party packages to handle most of the these things for us.

Conclusion

In this article, we have learnt the clinical radiology workflow and how DICOM standard is useful in digitally transferring data between DICOM modalities.

In the next article, we will dig into DICOM file formats and learn about the structure of DICOM data.

Verifying TLS Certificate Chain With OpenSSL


Introduction

To communicate securely over the internet, HTTPS (HTTP over TLS) is used. A key component of HTTPS is Certificate authority (CA), which by issuing digital certificates acts as a trusted 3rd party between server(eg: google.com) and others(eg: mobiles, laptops).

In this article, we will learn how to obtain certificates from a server and manually verify them on a laptop to establish a chain of trust.

Chain of Trust

TLS certificate chain typically consists of server certificate which is signed by intermediate certificate of CA which is inturn signed with CA root certificate.

Using OpenSSL, we can gather the server and intermediate certificates sent by a server using the following command.

$ openssl s_client -showcerts -connect avilpage.com:443

CONNECTED(00000006)
depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert High Assurance EV Root CA
verify return:1
depth=1 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert SHA2 High Assurance Server CA
verify return:1
depth=0 C = US, ST = California, L = San Francisco, O = "GitHub, Inc.", CN = www.github.com
verify return:1
---
Certificate chain
 0 s:/C=US/ST=California/L=San Francisco/O=GitHub, Inc./CN=www.github.com
   i:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert SHA2 High Assurance Server CA
-----BEGIN CERTIFICATE-----
MIIHMTCCBhmgAwIBAgIQDf56dauo4GsS0tOc8
MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlna
0wGjIChBWUMo0oHjqvbsezt3tkBigAVBRQHvF
aTrrJ67dru040my
-----END CERTIFICATE-----
 1 s:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert SHA2 High Assurance Server CA
   i:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert High Assurance EV Root CA
-----BEGIN CERTIFICATE-----
MIIEsTCCA5mgAwIBAgIQBOHnpNxc8vNtwCtC
MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGln
0wGjIChBWUMo0oHjqvbsezt3tkBigAVBRQHv
cPUeybQ=
-----END CERTIFICATE-----

    Verify return code: 0 (ok)

This command internally verfies if the certificate chain is valid. The output contains the server certificate and the intermediate certificate along with their issuer and subject. Copy both the certificates into server.pem and intermediate.pem files.

We can decode these pem files and see the information in these certificates using

$ openssl x509 -noout -text -in server.crt

Certificate:
    Data:
        Version: 3 (0x2)
    Signature Algorithm: sha256WithRSAEncryption
    ----

We can also get only the subject and issuer of the certificate with

$ openssl x509 -noout -subject -noout -issuer -in server.pem

subject= CN=www.github.com
issuer= CN=DigiCert SHA2 High Assurance Server CA

$ openssl x509 -noout -subject -noout -issuer -in intermediate.pem

subject= CN=DigiCert SHA2 High Assurance Server CA
issuer= CN=DigiCert High Assurance EV Root CA

Now that we have both server and intermediate certificates at hand, we need to look for the relevant root certificate (in this case DigiCert High Assurance EV Root CA) in our system to verify these.

If you are using a Linux machine, all the root certificate will readily available in .pem format in /etc/ssl/certs directory.

If you are using a Mac, open Keychain Access, search and export the relevant root certificate in .pem format.

We have all the 3 certificates in the chain of trust and we can validate them with

$ openssl verify -verbose -CAfile root.pem -untrusted intermediate.pem server.pem
server.pem: OK

If there is some issue with validation OpenSSL will throw an error with relevant information.

Conclusion

In this article, we learnt how to get certificates from the server and validate them with the root certificate using OpenSSL.

Writing & Editing Code With Code - Part 1


In Python community, metaprogramming is often used in conjunction with metaclasses. In this article, we will learn about metaprogramming, where programs have the ability to treat other programs as data.

Metaprogramming

When we start writing programs that write programs, it opens up a lot of possibilities. For example, here is a metaprogramme that generates a program to print numbers from 1 to 100.

with open('num.py', 'w') as fh:
    for i in range(100):
        fh.write('print({})'.format(i))

This 3 lines of program generates a hundred line of program which produces the desired output on executing it.

This is a trivial example and is not of much use. Let us see practical examples where metaprogramming is used in Django for admin, ORM, inspectdb and other places.

Metaprogramming In Django

Django provides a management command called inspectdb which generates Python code based on SQL schema of the database.

$ ./manage.py inspectdb

from django.db import models

class Book(models.Model):
    name = models.CharField(max_length=100)
    slug = models.SlugField(max_length=100)
    ...

In django admin, models can be registered like this.

from django.contrib import admin

from book.models import Book


admin.site.register(Book)

Eventhough, we have not written any HTML, Django will generate entire CRUD interface for the model in the admin. Django Admin interface is a kind of metaprogramme which inspects a model and generates a CRUD interface.

Django ORM generates SQL statements for given ORM statements in Python.

In [1]: User.objects.last()
SELECT "auth_user"."id",
       "auth_user"."password",
       "auth_user"."last_login",
       "auth_user"."is_superuser",
       "auth_user"."username",
       "auth_user"."first_name",
       "auth_user"."last_name",
       "auth_user"."email",
       "auth_user"."is_staff",
       "auth_user"."is_active",
       "auth_user"."date_joined"
  FROM "auth_user"
 ORDER BY "auth_user"."id" DESC
 LIMIT 1


Execution time: 0.050304s [Database: default]

Out[1]: <User: anand>

Some frameworks/libraries use metaprogramming to solve problems realted to generating, modifying and transforming code.

We can also use these techniques in everyday programming. Here are some use cases.

  1. Generate REST API automatically.
  2. Automatically generate unit test cases based on a template.
  3. Generate integration tests automatically from the network traffic.

These are a some of the things related to web development where we can use metaprogramming techniques to generate/modify code. We will learn more about this in the next part of the article.

Tips On Writing Data Migrations in Django Application


Introduction

In a Django application, when schema changes Django automatically generates a migration file for the schema changes. We can write additional migrations to change data.

In this article, we will learn some tips on writing data migrations in Django applications.

Use Management Commands

Applications can register custom actions with manage.py by creating a file in management/commands directory of the application. This makes it easy to (re)run and test data migrations.

Here is a management command which migrates the status column of a Task model.

from django.core.management.base import BaseCommand
from library.tasks import Task

class Command(BaseCommand):

    def handle(self, *args, **options):
        status_map = {
            'valid': 'ACTIVE',
            'invalid': 'ERROR',
            'unknown': 'UKNOWN',
        }
        tasks = Task.objects.all()
        for tasks in tasks:
            task.status = status_map[task.status]
            task.save()

If the migration is included in Django migration files directly, we have to rollback and re-apply the entire migration which becomes cubersome.

Link Data Migrations & Schema Migrations

If a data migration needs to happen before/after a specific schema migration, include the migration command using RunPython in the same schema migration or create seperate schema migration file and add schema migration as a dependency.

def run_migrate_task_status(apps, schema_editor):
    from library.core.management.commands import migrate_task_status
    cmd = migrate_task_status.Command()
    cmd.handle()


class Migration(migrations.Migration):

    dependencies = [
    ]

    operations = [
        migrations.RunPython(run_migrate_task_status, RunSQL.noop),
    ]

Watch Out For DB Queries

When working on a major feature that involves a series of migrations, we have to be careful with data migrations(which use ORM) coming in between schema migrations.

For example, if we write a data migration script and then make schema changes to the same table in one go, then the migration script fails as Django ORM will be in invalid state for that data migration.

To overcome this, we can explicitly select only required fields and process them while ignoring all other fields.

# instead of
User.objects.all()

# use
User.objects.only('id', 'is_active')

As an alternative, we can use raw SQL queries for data migrations.

Conclusion

In this article, we have seen some of the problems which occur during data migrations in Django applications and tips to alleviate them.

Profiling & Optimizing Bottlenecks In Django


In the previous article, we have learnt where to start with performance optimization in django application and find out which APIs to optimize first. In this article, we will learn how to optimize those selected APIs from the application.

Profling APIs With django-silk

django-silk provides silk_profile function which can be used to profile selected view or a snippet of code. Let's take a slow view to profile and see the results.

from silk.profiling.profiler import silk_profile


@silk_profile()
def slow_api(request):
    time.sleep(2)
    return JsonResponse({'data': 'slow_response'})

We need to add relevant silk settings to django settings so that required profile data files are generated and stored in specified locations.

SILKY_PYTHON_PROFILER = True
SILKY_PYTHON_PROFILER_BINARY = True
SILKY_PYTHON_PROFILER_RESULT_PATH = '/tmp/'

Once the above view is loaded, we can see the profile information in silk profiling page.

In profile page, silk shows a profile graph and highlights the path where more time is taken.

It also shows cprofile stats in the same page. This profile data file can be downloaded and used with other visualization tools like snakeviz.

By looking at the above data, we can see most of the time is spent is time.sleep in our view.

Profling APIs With django-extensions

If you don't want to use silk, an alternate way to profile django views is to use runprofileserver command provided by django-extensions package. Install django-extensions package and then start server with the following command.

$ ./manage.py runprofileserver --use-cprofile --nostatic --prof-path /tmp/prof/

This command starts runserver with profiling tools enabled. For each request made to the server, it will save a corresponding .prof profile data file in /tmp/prof/ folder.

After profile data is generated, we can use profile data viewing tools like snakeviz, cprofilev visualize or browse the profile data.

Install snakeviz using pip

$ pip install snakeviz

Open the profile data file using snakeviz.

$ snakeviz /tmp/prof/api.book.list.4212ms.1566922008.prof

It shows icicles graph view and table view of profile data of that view.

These will help to pinpoint which line of code is slowing down the view. Once it is identified, we can take appropriate action like optimize that code, setting up a cache or moving it to a task queue if it is not required to be performed in the request-response cycle.

Versioning & Retrieving All Files From AWS S3 With Boto


Introduction

Amazon S3 (Amazon Simple Storage Service) is an object storage service offered by Amazon Web Services. For S3 buckets, if versioning is enabled, users can preserve, retrieve, and restore every version of the object stored in the bucket.

In this article, we will understand how to enable versioning for a bucket and retrieve all versions of an object from AWS web interface as well as Python boto library.

Versioning of Bucket

Bucket versioning can be changed with a toggle button from the AWS web console in the bucket properties.

We can do the same with Python boto3 library.

import boto3


bucket_name = 'avilpage'

s3 = boto3.resource('s3')
versioning = s3.BucketVersioning(bucket_name)

# check status
print(versioning.status)

# enable versioning
versioning.enable()

# disable versioning
versioning.suspend()

Retrieving Objects

Once versioning is enabled, we can store multiple versions of an object by uploading an object multiple times with the same key.

We can write a simple script to generate a text file with a random text and upload it to S3.

import random
import string

import boto3

file_name = 'test.txt'
key = file_name
s3 = boto3.client('s3')

with open(file_name, 'w') as fh:
    data = ''.join(random.choice(string.ascii_letters) for _ in range(10))
    fh.write(data)

s3.upload_file(key, bucket_name, file_name)

If this script is executed multiple times, the same file gets overridden with a different version id with the same key in the bucket.

We can see all the versions of the file from the bucket by selecting the file and then clicking drop-down at Latest version.

We can write a script to retrieve and show contents of all the versions of the test.txt file with the following script.

import boto3


bucket_name = 'avilpage'
s3_client = boto3.client('s3')

versions = s3_client.list_object_versions(Bucket=bucket_name)

for version in versions:
    version_id = versions['Versions'][0]['VersionId']
    file_key = versions['Versions'][0]['Key']

    response = s3.get_object(
        Bucket=bucket_name,
        Key=file_key,
        VersionId=version_id,
    )
    data = response['Body'].read()
    print(data)

Conclusion

Object versioning is useful to protect data from unintended overwrites. In this article, we learnt how to change bucket versioning, upload multiple versions of same file and retrieving all versions of the file using AWS web console as well as boto3.

Why My Grandma Can Recall 100+ Phone Numbers, But We Can't


On a leisure evening, as I was chit chatting with my grandma, my phone started ringing. Some one who is not in my contacts is calling me. As I was wondering who the heck is calling, my grandma just glanced at my screen and said, "Its you uncle Somu, pick up the phone". I was dumbstruck by this.

Later that evening, I asked my grandma to recall the phone numbers she remembers. She recalled 30+ phone numbers and she was able to recognize 100+ phone numbers based on the last 4 digits of the mobile.

That came as a surprise for me as I don't even remember 10+ phone numbers now. Most of the smart phone users don't remember family and friends phone numbers anymore.

A decade back, I used to remember most of my relatives and friends phone numbers even though I didn't had a phone. My grandma used to use a mini notebook to write all the phone numbers. I was worried about this mini notebook as it can get lost easily and it is always hard to find when required.

Since my grandma didn't have any contacts in her phone, she gets a glimpse of number every time someone calls her. She also dials the number every time she has to call someone. With this habit she is able to memorize all the numbers.

I, on the other hand started using a smart phone which has all the contacts. I search my contacts by name when I have to dial someone and there is no need to dial the number. Also whenever someone call me, their name gets displayed in large letters and I never get to focus on the number. Due to this, I don't remember any of the phone numbers

After this revelation, I started an experiment by disabling contact permissions for dialer app. With this, I am forced to type the number or select appropriate number from the call history and dial it. This was a bit uncomfortable at first. Soon I got used to it as recognized more and more numbers.

This might seem unnecessary in the smart phone age. But when you are traveling or when your phone gets switched off, it's hard to contact people. Even if someone gives their phone, it is of no use if I don't remember any numbers.

Also it is important to remember phone numbers of family and friends which might be needed in case of emergencies.