Association studies have linked microbiome alterations with many human diseases. However, they have not always reported consistent results, thereby necessitating cross-study comparisons. Here, a meta-analysis of eight geographically and technically diverse fecal shotgun metagenomic studies of colorectal cancer (CRC, n = 768), which was controlled for several confounders, identified a core set of 29 species significantly enriched in CRC metagenomes (false discovery rate (FDR) < 1 × 10−5). CRC signatures derived from single studies maintained their accuracy in other studies. By training on multiple studies, we improved detection accuracy and disease specificity for CRC. Functional analysis of CRC metagenomes revealed enriched protein and mucin catabolism genes and depleted carbohydrate degradation genes. Moreover, we inferred elevated production of secondary bile acids from CRC metagenomes, suggesting a metabolic link between cancer-associated gut microbes and a fat- and meat-rich diet. Through extensive validations, this meta-analysis firmly establishes globally generalizable, predictive taxonomic and functional microbiome CRC signatures as a basis for future diagnostics.
Please note:
The data are at the moment only available from within the EMBL intranet, but will be moved to Zenodo or another repository soon.
-
This project requires R version 3.5.1 -- "Feather Spray"
-
Please make sure that you have all required packages installed. You can easily check whether that is the case by running
Rscript requirements.R
and confirm that it does not produce any errors.
-
This project relies heavily on the R package
SIAMCAT
.
For the case that you are using an older version of the package, it will undoubtedly throw some errors. Thus, please make sure that you have the version 1.1.0 installed. You can install the correct version ofSIAMCAT
via the gitlab repository. Download the tar-ball of the package and then install the package by typing in R:install.packages("SIAMCAT-v1.1.0.tar", repos = NULL, type="source")
- you may have to change the path to the path of the downloaded file
- in Windows, you will need to have the package rtools installed > - i guess... I have not tested this on a Windows system yet
-
You will need the GMMs tools to calculate the abundance of gut metabolic modules. In this project, I used the version fcc7ac1 of the GMMs tool.
To reproduce the main figures of the project, you can follow these instructions.
First, you will have to download the taxonomic and functional profiles and the metadata needed for this project. They are stored on the ftp server of the Zeller team and will be automatically downloaded and cleaned by running this script.
cd ./src
Rscript prepare_data.R
In order to run the association testing pipeline, please use the
marker_analysis.R
script. The script computes the generalised fold change
(gFC), area under the Receiver-Operating-Characteristics curve (AUROC), and
the significance of single markers using a blocked Wilcoxon test. Additionally,
the script will run ANCOM
for taxonomic profiles. The marker analysis can be
run for different sets of features, e.g. KEGG
functional profiles and
species-level taxonomic profiles.
Rscript marker_analysis.R KEGG
Rscript marker_analysis.R species
The marker heatmap and the forest plot can be generated by calling this script:
Rscript figure_marker_heatmap.R species
Note; this script may break if you use it for other features than species, e.g. KEGG functional profiles
in this case, you may need to manually tweak the script to get satisfying results
- As an additional analysis, we tested how well the meta-analysis associations
(seen here as a
true
set of associations) can be found when assessing only a single study. This is done by calculating precision and recall for the detection of the true set given the single study associations.
Rscript figure_precision_recall.R
- Since the concept of the generalized fold change may be new, there is a script that creates a figures which should aid the explanation of the concept:
Rscript figure_fc_explanation.R
To reproduce the confounder analysis and the accompanying figure, please type:
Rscript confounder_analysis.R species
The script will compute the variance explained by the disease status (i.e. CRC)
and by potential confounding variables (e.g. Age, Sex, BMI, etc.). The results
from this analysis motivated the inclusion of Study
and Colonoscopy
into
the differential abundance testing pipeline (by blocking for study and
colonoscopy status).
The script will also produce a plot contrasting the variance explained by the
potential confounder and by disease status.
- If you want to recreate the alpha/beta diversity analysis (which also shows
the strong influence of
Study
as a confounder), you can use this script:
Rscript figure_a_b_diversity.R
- We assessed another potential confounder in our data, namely the compositional nature of microbiome data. The method ANCOM accounts for compositionality effects and was applied to the data alongside the blocked Wilcoxon test. The following script reproduces the supplementary figure comparing Wilcoxon and ANCOM:
Rscript figure_ancom.R
We found that the most strongly associated species can be grouped into four different clusters with preferential enrichment in different patient subgroup. The code to reproduce this analysis can be executed by running:
Rscript cluster_species.R
Note: The ordering of the species in the heat map is not identical to the ordering in the clustered histogram
The script will also produce a set of plots showing the positivity for each species clusters in different patient subgroups.
The machine learning pipeline can be run using the train_models.R
script. The
script will train one LASSO model for each study and one model for each
leave-one-study-out setting. The models will be save in the respective
./models/
folder. Please note that this step can take quite some time,
depending on the input features. For example, training all models for eggNOG
input features took approximately 8 hours on my machine.
After training the models, you can also reproduce the performance figure, using
the figure_performance.R
script.
Rscript train_models.R species
Rscript figure_performance.R species
Note: training all models for species input features took circa 45 min on my machine.
If you want to run the machine learning pipeline with another machine learning method, you can
- either change the respective entry in the
parameters.yaml
file - or you supply a second command line argument to
train_models.R
.
For example, if you want to train a Random Forest model and check the performance, you can type:
Rscript train_models.R species randomForest
Rscript figure_performance.R species randomForest
There are a few additional analyses that can be performed:
- Bias in the predictions:
To test the predictions for any bias from potential confounding variables, use the script:
Rscript figure_prediction_bias.R species
In this script, you can give again the feature type and additionally the machine learning method as command line arguments.
- Classifier weights:
For theLASSO
classifier, we can also investigate the feature weights in more detail, using the script:
Rscript figure_weights.R
- Combined classifier:
We can also run a classifier on the combination of eggNOG and species features:
Rscript train_models_combination.R
- Performance on the gene catalogue:
LASSO models had also been trained on the complete gene catalogue, albeit not withSIAMCAT
, since there are too many features. Therefore, we need another script to plot the results:
Rscript figure_performance_genes.R
- Gut Metabolic Modules
The gut metabolic modules are based on the publication by Vireia-Silva et al. and the accompanying tool GMMs. The GMMs are computed based on KEGG abundances. Please follow the scriptcompute_GMMs.R
, in which you will also find instructions to run the GMMs tool.
The main plots for the GMMs can be replicated by calling:
Rscript figure_GMMs.R
Another analysis trying to link the GMMs with species level abundances can be re-run with:
Rscript link_GMM_species.R
- Potential CRC-related functions
In order to test several different CRC-related functions of the microbiome, we screened the IGC with several HMMs. The HMMs were created with HMMER and can be found under./files/genes/hmms/
.
To recreate the panels of the main and the extended data figure, you can type:
Rscript figure_gene_abundance.R
Rscript figure_bai_expression.R
Rscript figure_bai_extended.R
There are two different types of external validations included in this project.
- Other CRC studies
To check the associations and the classification accuracy in other CRC studies, you can run these scripts:
Rscript external_validation_associations.R
Rscript external_validation_classification.R species
Here, you can use all the models that have been trained before
(i.e. species
/KEGG
/eggNOG
).
Again, you can indicate another machine learning method as second command line
argument.
Also, the functional enrichment can be reproduced in the external CRC studies.
In order to do so, you can type:
Rscript external_validation_genes.R
- Non-CRC studies
We also looked at other studies with shotgun metagenomic analyses that included non-CRC patients to check how much cross-classification we get with other diseases. In order to run this analysis, you can use:
Rscript external_validation_cross_classification.R
Note: since this analysis only works on species-level taxonomic profiles as input features, there is no argument to specify the features. But again, there is an optional argument for the machine learning method to be used (if the method differs from the method specified in the
parameters.yaml
file)
If you have questions about the code or the analyses presented in this project, please feel free to contact me, Jakob Wirbel 😃