This repository is created for PSU CSE 566 final project.
Assessing some recently published metagenomic profilers for viral and fungal samples
Many metagenomic profiling tools (e.g. MetaBinG2, MiCoP and Metaglin) were published in recent few years. These tools claim that they are the best in specific analysis among the existing metagenomic profiling tools. Most of them were compared only for identifying bacterial community and thus might be bias against viruses and fungi. To order to better determine how to choose these tools for analyzing viral or fungal samples, it is needed to have an independent and comprehensive comparison among them.
In this evaluation, two more metagenomic profilers (DIAMOND+MEGAN and MetaPhlAn3) are included besides those three mentioned above because DIAMOND+MEGAN's profiling performance is not bad in terms of accuracy in the comparison within the papers of Metaglin and MetaPhlAn3 is the new version of MetaPhlAn which hasn't been compared with other profilers before.
In order to have an unbiased comparison and show their corresponding features, I first used the viral and fungal databases from MiCoP and rebuilt the databases for Metalign, MetaBinG2. I can't build the corresponding databases for MetaPhlAn3 because its database is based on marker genes which are different to build in a short time. And DIMOND is designed mainly for the NCBI database. In addition, since in pratice we can't design the database for a given metagenomic sample, I also compared them by using their own databases. Two kinds of data which can be evaluated by known abundance were utilized: simulated data (which is generated by using CAMISIM software) and mock community data (which is a real data but known for its candidate microbes and their approximate proportions). The profilers will be assessed in the following five parts:
- Accuracy of taxon identification in different ranks (Phylum, Class, Order, Family, Genus and Species)
- Accuracy of abundance estimation
- Robustness under the influence of unknown organism
- Speed
- Memory Usage
Part1 and Par2 were evaluated by a CAMI competition Software OPAL.
This project can be executed by simply running all shell command lines within the bash script main_run.sh
. Since CAMISIM needs to run with python 2.7 and other software run with python 3.7, this bash script can't run in a whole. But you can follow the instruction to run it step by step.
All result files are stored within data/CAMI_OPAL/results
. And the log files for checking speed and memory usage can be found within data/run_data
.