Mastering Kraken2 - Part 4 - Build FDA-ARGOS Index

Mastering Kraken2

Part 1 - Initial Runs

Part 2 - Classification Performance Optimisation

Part 3 - Build custom database indices

Part 4 - Build FDA-ARGOS index (this post)

Part 5 - Regular vs Fast Builds (upcoming)

Part 6 - Benchmarking (upcoming)

Introduction

In the previous post, we learnt how to build a custom index for Kraken2.

FDA-ARGOS1 is a popular database with quality reference genomes for diagnostic usage. Let's build an index for FDA-ARGOS.

FDA-ARGOS Kraken2 Index

FDA-ARGOS db is available at NCBI2 from which we can download the assembly file.

FDA-ARGOS NCBI

We can extract accession numbers from the assembly file and then download the genomes from these accession ids.

$ grep -e "^#" -v PRJNA231221_AssemblyDetails.txt | cut -d$'\t' -f1 > accessions.txt

$ wc accessions.txt
 1428  1428 22848 accessions.txt

$ ncbi-genome-download --section genbank --assembly-accessions accessions.txt --progress-bar bacteria --parallel 40

It took ~8 minutes to download all the genomes, and the downloaded file size is ~4GB.

We can use kraken-db-builder3 tool to build index from these genbank genome files.

# kraken-db-builder needs this to convert gbff to fasta format
$ conda install -c bioconda any2fasta

$ kraken-db-builder --genomes-dir genbank --threads 36 --db-name k2_argos

It took ~30 minutes to build the index.

Conclusion

We have built a Kraken2 index for the FDA-ARGOS database on 2024-Aug-24.

In the next post, we will look at the differences between regular and fast builds.


Need further help with this? I am available for hire.