Mastering Kraken2 - Part 3 - Build Custom Database
Mastering Kraken2
Part 2 - Classification Performance Optimisation
Part 3 - Build custom database indices (this post)
Part 4 - Build FDA-ARGOS index
Part 5 - Regular vs Fast Builds (upcoming)
Part 6 - Benchmarking (upcoming)
Introduction
In the previous post, we learned how to improve kraken21 classification performance. So far we have downloaded & used pre-built genome indices(databases).
In this post, let's build a custom database for kraken2. For simplicity, let's use only refseq archaea genomes2 for building the index.
Building Custom Database
First, we need to download the taxonomy files. We can use the k2
script provided by kraken2.
$ k2 download-taxonomy --db custom_db
This takes ~30 minutes depending on the network speed. The taxonomy files are downloaded to the custom_db/taxonomy
directory.
$ ls custom_db/taxonomy citations.dmp division.dmp gencode.dmp merged.dmp nodes.dmp nucl_wgs.accession2taxid delnodes.dmp gc.prt images.dmp names.dmp nucl_gb.accession2taxid readme.txt $ du -hs custom_db/taxonomy 43G custom_db/taxonomy
For simplicity, let's use the archaea refseq genomes. We can use kraken2-build
to download the refseq genomes.
$ k2 download-library --library archaea --db custom_db
This runs on a single thread. Instead of using kraken2-build
, we can use ncbi-genome-download
3 tool to download the genomes. This provides much granular control over the download process. For example, we can download only --assembly-levels complete
genomes. We can also download multiple genomes in parallel.
$ pip install ncbi-genome-download $ conda install -c bioconda ncbi-genome-download $ ncbi-genome-download -s refseq -F fasta --parallel 40 -P archaea Checking assemblies: 100%|███| 2184/2184 [00:19<00:00, 111.60entries/s] Downloading assemblies: 100%|███| 2184/2184 [02:04<00:00, 4.54s/files] Downloading assemblies: 2184files [02:23, 2184files/s]
In just 2 minutes, it has downloaded all the files. Lets gunzip the files.
$ find refseq -name "*.gz" -print0 | parallel -0 gunzip $ du -hs refseq 5.9G refseq
Lets add all fasta genome files to the custom database
$ time find refseq -name "*.fna" -exec kraken2-build --add-to-library {} --db custom_db \; 667.46s user 90.78s system 106% cpu 12:54.80 total
kraken2-build
doesn't use multiple threads for adding genomes to the database. In addition to that, it also doesn't check if the genome is already present in the database.
Let's use k2
for adding genomes to the database.
export KRAKEN_NUM_THREADS=40 $ find . -name "*.fna" -exec k2 add-to-library --files {} --db custom_db \; 668.37s user 88.44s system 159% cpu 7:54.40 total
This took only half the time compared to kraken2-build
.
Let's build the index from the library.
$ time kraken2-build --db custom_db --build --threads 36 Creating sequence ID to taxonomy ID map (step 1)... Found 0/125783 targets, searched through 60000000 accession IDs... Found 59923/125783 targets, searched through 822105735 accession IDs, search complete. lookup_accession_numbers: 65860/125783 accession numbers remain unmapped, see unmapped.txt in DB directory Sequence ID to taxonomy ID map complete. [2m1.950s] Estimating required capacity (step 2)... Estimated hash table requirement: 5340021028 bytes Capacity estimation complete. [23.875s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 11 bits reserved for taxid. Completed processing of 59911 sequences, 3572145823 bp Writing data to disk... complete. Database files completed. [12m3.368s] Database construction complete. [Total: 14m29.666s] kraken2-build --db custom_db --build --threads 36 24534.98s user 90.50s system 2831% cpu 14:29.75 total $ ls -ll .rw-rw-r-- 5.3G anand 1 Aug 16:35 hash.k2d drwxrwxr-x - anand 1 Aug 12:32 library .rw-rw-r-- 64 anand 1 Aug 16:35 opts.k2d .rw-rw-r-- 1.5M anand 1 Aug 16:22 seqid2taxid.map .rw-rw-r-- 115k anand 1 Aug 16:23 taxo.k2d lrwxrwxrwx 20 anand 1 Aug 12:31 taxonomy .rw-rw-r-- 1.2M anand 1 Aug 16:22 unmapped.txt
We are able to build index for ~6GB input files in ~15 minutes.
Conclusion
We learnt some useful tips to speed up the custom database creation process. In the next post, we will learn about regular vs. fast builds.
Need further help with this? Feel free to send a message.