Metagenomics Machine Learning

Introduction

This is the reference material for the "Machine Learning in Metagenomics" workshop. By the end of workshop, you will be able to classify metagenomic samples using Kraken2 and predict the source organism of the sample with simple ML methods.

Environment Setup

We will be using Python, Jupyter notebook and command line tools for this workshop.

If you are using Windows, you can install WSL2 and use Ubuntu. If you are using Mac, you can use the terminal.

Running Kraken2

In the first step, let's generate abundance report using Kraken2. Kraken2 is a taxonomic sequence classifier that assigns taxonomic labels to short DNA sequences.

Install anaconda

$ wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
$ bash Anaconda3-2021.05-Linux-x86_64.sh

Install kraken2

$ conda install -c bioconda kraken2

Download pre-built viral database

$ wget -c https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20240605.tar.gz
$ mkdir -p k2_viral
$ tar -xzvf k2_viral_20240605.tar.gz -C k2_viral

Download sample fastq files

$ wget -c https://ftp.sra.ebi.ac.uk/vol1/fastq/ERR103/077/ERR10359977/ERR10359977.fastq.gz

Kraken2 classification

$ kraken2 --db k2_viral ERR10359977.fastq.gz --report k2_report.txt --output k2_output.txt

Generate Krona plot

Install Krona & update taxonomy

$ conda install -c bioconda krona
$ ktUpdateTaxonomy.sh

Generate Krona plot

$ ktImportTaxonomy k2_report.txt -o k2_report.html

Building custom database

Download taxanomy

$ kraken2-build --download-taxonomy --db k2_fungi

Download required genomes

$ ncbi-genome-download --format fasta --section refseq --assembly-level complete fungi -v

Build kraken2 database

$ find refseq -name "*.gz" -print0 | parallel -0 gunzip

$ find refseq -name "*.fna" -exec kraken2-build --add-to-library {} --db custom_db \;

$ kraken2-build --db custom_db --build --threads 36

Using kraken-db-builder to build database

$ pip install ncbi-genome-download kraken-db-builder

$ kraken-db-builder --db-type fungi

Predicting Source of Metagenomic Sample

Geometry Mean of Pairwise Ratios (GMPR)

Bray-Curtis dissimilarity

t-Stochastic Neighbor Embedding (t-SNE)

k-Nearest Neighbors (k-NN)

Download sample data

$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/test/dog_test_sink_sample.csv -O dog_example.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_labels.csv -O sp_labels.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_sources.csv -O sp_sources.csv

Download sourcepredict

$ python -m pip install git+https://github.com/AvilPage/sourcepredict

Machine Learning Notebooks

https://github.com/ChillarAnand/avilpage.com/tree/master/mg_workshop

Useful links:

https://benlangmead.github.io/aws-indexes/k2

https://jszym.com/blog/dna_protein_complexity/

https://avilpage.com/2024/07/mastering-kraken2-initial-runs.html

https://github.com/maxibor/sourcepredict