Most Indian languages have strong consonant-vowel structure which combine to give syllables. These syllables are written as one continuous ligature and they require complex text rendering (CTL) for type setting.
Writing OCR (Optical Character Recognistion) software for CTL scripts is a challenging task as segmentation is hard. Because of this overall accuracy drops drastically.
A better approach is to use Connectionist Temporal Classification1(CTC) which can identify unsegmented sequence directly as it has one-to-one correspondence between input samples and output labels.
Here is a sample input and output of a RNN-CTC network which takes an unsegmented sequence and outputs labels.
Open source OCR software ocorpy uses BLSTM-CTC for text recognistion. Tesseract started using the same in its latest(4.0) version.
I have trained a model to recognize Telugu script using ocropy and the accuracy is ~99% which is far better when compared to OCR softwares without CTC which are accurate to ~70%.
In registry pattern, a registry maintains global association from keys to objects, so that objects can be reached from anywhere by simple identifier. This is useful for doing reverse lookups.
When building a registry, programmers have to explicitly register each object with registry. Manually building a registry is error prone and it is tedious if there are too many objects to register. It is better to auto register objects if possible.
During development, we will keep on changing the code base. Manually restarting celery worker everytime is a tedious process. It would be handy if workers can be auto reloaded whenever there is a change in the codebase.
Watchdog provides Python API and shell utilities to monitor file system events. We can install it with
pip install watchdog
Watchdog provides watchmedo a shell utilitiy to perform actions based on file events. It has auto-restart subcommand to start a long-running subprocess and restart it. So, celery workers can be auto restarted using this.
watchmedo auto-restart -- celery worker -l info -A foo
By default it will watch for all files in current directory. These can be changed by passing corresponding parameters.
watchmedo auto-restart -d . -p '*.py' -- celery worker -l info -A foo
If you are using django and don't want to depend on watchdog, there is a simple trick to achieve this. Django has autoreload utility which is used by runserver to restart WSGI server when code changes.
The same functionality can be used to reload celery workers. Create a seperate management command called celery. Write a function to kill existing worker and start new worker. Now hook this function to autoreload as follows.
importshleximportsubprocessfromdjango.core.management.baseimportBaseCommandfromdjango.utilsimportautoreloaddefrestart_celery():cmd='pkill -9 celery'subprocess.call(shlex.split(cmd))cmd='celery worker -l info -A foo'subprocess.call(shlex.split(cmd))classCommand(BaseCommand):defhandle(self,*args,**options):print('Starting celery worker with autoreload...')autoreload.main(restart_celery)
Now you can run celery worker with python manage.py celery which will start a celery worker and autoreload it when codebase changes.
HTTPie is an alternative to curl for interacting with web services from CLI. It provides a simple and intuitive interface and it is handy to send arbitrary HTTP requests while testing/debugging APIs.
When working with web applications that require authentication, using httpie is difficult as authentication mechanism will be different for different applications. httpie has in built support for basic & digest authentication.
To authenticate with Django apps, a user needs to make a GET request to login page. Django sends login form with a CSRF token. User can submit this form with valid credentials and a session will be initiated.
Establish session manually is boring and it gets tedious when working with multiple apps in multiple environments(development, staging, production).
I have written a plugin called httpie-django-auth which automates django authentication. It can be installed with pip
pip install httpie-django-auth
By default, it uses /admin/login to login. If you need to use some other URL for logging, set HTTPIE_DJANGO_AUTH_URL environment variable.
Now you can send authenticated requests to any URL as
When working on multiple projects, it becomes necessary to use virtualenvs so that multiple versions of same package can be used. In addition to that, it be necessary to set environment variables on a per project basis.
To automate all these things, autoenv provides directory based environments. Whenever user changes directory, it will help to automatically activate environment and set environment variables.
If you have file named .env in a directory, autoenv will automatically source that file whenever user enters into it.
autoenv is a python package. It can be installed with
pip install autoenv
It provides a shell script which needs to sourced.
echo"source `which activate.sh`" >> ~/.bashrc
You can create a .env file like this in project root.
Setting up a new laptop manually takes a lot of time and there is a good chance of forgetting tweaks made to configuration files. It is good idea to automate it via a shell script or using configuration management tools like Ansible. It also makes easy to sync configuration across multiple systems.
Ansible is lightweight and provides only a thin layer of abstraction. It connects to hosts via ssh and pushes changes. So, there is no need to setup anything on remote hosts.
Writing A Playbook
You should check out Ansible documentation to get familiar with ansible and writing playbooks. Ansible uses yaml format for playbooks and it's human readable. Here is a simple playbook to install redis on ubuntu server.
BMTC (Bengaluru Metropolitan Transport Corporation) is a government agency which operates the public transport bus service in Bangalore, India. It holds monopoly as it is the only choice for public transportation.
Let's say, you want to travel from Indira Nagar to BTM Layout, it will cover 5 stages. So, BMTC charges 19₹ for that.
If you travel from Indira Nagar KFC signal to Doopanahalli Arch(1.2KM), which comes under 1 stage, you have to pay 5₹. On the other hand, if you travel Indira Nagar KFC signal to Doopanahalli bustop(1.4KM), you have to pay 12₹.
How can BMTC charge 5₹ for the first 1.2KM and 7₹ for the subsequent 0.2KM? If BMTC charges, 5₹ for 1st stage. Then it should charge 5₹ + extra for next stage. But it shouldn't be more than 10₹.
You can just take 1 ticket(5₹) for 1st stage(1.2KM) and one more ticket(5₹) for the next stage(0.2KM). You can just travel 2 stages with 2 tickets for 10₹.
Turnsout BMTC is charging 3₹ extra on every ticket which covers atleast 2 stages. As lakhs of people travel in BMTC buses daily, in a month, this 3₹ turns into crores of rupees.
I am not sure when BMTC has started charging like this. A month back, I sent them an email asking for an explaination of unfair bus tickets and they haven't replied yet.
In Python 3, super can be invoked without arguments and it will choose right class & instance automatically.
For this refactoring, a simple sed search/replace should suffice. But, there are several hacks in codebase where super calls the grandparent instead of the parent. So, sed won't work in such cases. It is hard to refactor them manually and much harder for reviewers as there are 1364 super calls in code base.
→ grep -rI "super("| wc -l
So changes has to be scripted. A simple python script to replace super calls by class names will fail to capture classes with on top of them, classes with decorators and nested classes.
To handle all these cases, this python script gets more complicated and there is no guarantee that it can handle all edge cases. So, a better choice is to use syntax trees.
Python has ast module to convert code to AST but it can't convert AST back to code. There are 3rd party packages like astor which can do this.
# this is a commentdeffoo():print("hello world")
Converting above code to AST and then converting back gives this
Code to AST is a lossy transformation as they cannot preserve empty lines, comments and code formatting.
RedBaron package provides FST for given piece of code. With this, just locate super calls, find nearest class node, check class name with super and replace accordingly. With RedBaron, this refactoring can be done in less than 10 lines of code.
RedBaron has good documentation with relveant examples and its API is similar to BeautifulSoup. To write code that modifies code RedBaron seems to be an apt choice.
Publishing blog posts is lightning fast(how fast can you type?). You can just publish it with a click after writing. However with scientific papers you have to wait months to get your work published.
Blog posts are easy to edit. If you are using a version controlled system like git, any one can easily collaborate. You can also leave comments on (most) blog posts and can have discussion with the author and/or others. Scientific papers once published are hard to change.
To publish blog posts, you can setup your own blog or use services like medium. There are no gatekeepers who says that you can't publish a blog post because they thinks its not worth it.
Blog posts tend to be casual and organic. You can crack a joke and readers will enjoy it. Scientific papers are formal. You spray some sarcasm and your paper never gets published.