In general, each sequential step in this process should be run as ordered (e.g., 000_*
, then 00_*
, then 01_*
, up to 04_*
). Each directory contains usually three subdirectories: code
, data
, and analysis
. Procedures are coded in the code
directory, using input from the data
directory, and placing intermediate files also in data
. The final analysis is then using elements of data as input and returning results in analysis
. Sometimes, output elements from a given analysis are the input for a subsequent analysis, and the code uses either symbolic links or copy of files from one analysis into the other.
The 000_dataset_cleaning/source_data
folder contains the source datasets (Christmas and Solesmes) used in experiments.
They are present as CantusCorpus-style CSV files.
The current version of the Christmas dataset is available as christmas_dataset_ismir2024.csv
.
Compared to ISMIR 2023, additional cleaning has been applied: differentiae and other cues such
as repetenda were removed, the Cistercian melodies for the Judea et Jerusalem responsory
and its verse Constantes estote was transposed a fifth down to be compatible with the rest of the sources.
In order to obtain inputs for the bioinformatics pipeline, we need to further preprocess
the melodies. For phylogeny buidling, melodies should contain only notes and neume/syllable/word
separators using the standard volpiano dashes. All barlines and other non-note characters
should be removed. (The preprocessing steps in phylogeny building then takes care of the separators:
in some settings we might prefer to retain them.) This is done by the 000_dataset_cleaning/code/clean_christmas.py
script
(use flags -cbnder
for the full cleaning).
In order to get valid inputs for alignment, we then must create FASTA files
(with sigla as sequence names) individually for all
the chants. This step is implemented by the 000_dataset_cleaning/code/build_src_fasta.py
script.
The list of sigla that are retained (because these sources contain enough of all the Cantus IDs that we are analysing):
A-VOR Cod. 259/I
A-Wn 1799**
CDN-Hsmu M2149.L4
CH-E 611
CZ-HKm II A 4
CZ-PLm 504 C 004
CZ-Pn XV A 10
CZ-Pu I D 20
CZ-Pu XVII E 1
D-KA Aug. LX
D-KNd 1161
F-Pn lat. 12044
F-Pn lat. 15181
NL-Uu 406
Multiple sequence alignment was carried out using the notebook 00_tree_inference/code/00_alignment.ipynb
. We use the fasta-formated sequences from the previous step in order to align using mafft
with a custom score matrix (00_textmatrix_complete
). Some reformatting is necessary before the files can be used for Bayesian inference, and these are carried out by phyx
also in the same Python notebook.
Tree inference is carried out by mrbayes_volpiano
as triggered by the script 00_tree_inference/code/01_run_tree_inference.sh
, which uses the mrbayes_volpiano
script 01_tree_inference.mb
. This step uses the concatenated alignment 00_tree_inference/data/concatenated.nexus
and returns the analysis output to 00_tree_inference/analysis
.
Bayesian model selection using stepping stones is carried out on the maximum clade credibility tree, which is calculated in 01_model_selection/code/00_calculate_maxcredtree.sh
using phyx
and treeannotator
.
Then all the possible rooting positions on the MCC tree are generated by 01_model_selection/code/01_reroot_trees.R
using the packages ape
and phytools
in R
. The script 01_model_selection/code/
takes the tree file generated in this step and concatenates it to a general nexus file with both the concatenated alignment from 00_tree_inference/data
and an intermediate file with the tree block including all 25 possible rooted trees.
The script 01_model_selection/code/03_run_stepping_stones.sh
runs the analysis on the input files in 01_model_selection/data
and the template script 01_model_selection/code/stepping_stones.mb
for each of the 25 possible trees and saves the output in a dedicated directory name with the tree ID.
Collection of marginal lnL data are done grepping the log files and keeping the tree ID directory name and the lnL value with:
grep -a "Mean: " ../analysis/*/*log
These values are then stored manually into the file 01_model_selection/analysis/marginal_logliks.tsv
. This is the input for the script 01_model_selection/code/04_summarise_marginal_loglik.R
which generates a barplot with the model posterior probabilities for each tree and saves it to 01_model_selection/analysis/model_posterior_probability.pdf
The results suggest that the tree7
is the best one with a model posterior probability higher than 0.7. This is the input for the subsequent analyses.
This analysis is carried out by mrbayes_volpiano
using the script 02_divtime/code/00_run_mcmc_sampling.sh
, which in turn uses the concatenated alingment and trees from 01_model_selection/data/alignment_and_trees.nexus
, and then runs an analysis under the prior and under the posterior using the template script 02_divtime/code/mcmc_sampling.mb
.
The age information used as calibration densities is found in Table S1 below:
Table S1. Calibration densities (CD) used in DTE. Time scale is in both years before the present (YBP, as used by mrbayes_volpiano
) as well as in anno Domini (AD). Single time values represent fixed values whereas intervals represent Uniform(min,max) calibration densities.
Node | CD (YBP) | CD (AD) |
---|---|---|
A VOR Cod 259 I | 654 | 1370 |
A Wn 1799 | 724--824 | 1200--1300 |
CDN Hsmu M2149 L4 | 474 | 1550 |
CH E 611 | 624--724 | 1300--1400 |
CZ HKm II A 4 | 554 | 1470 |
CZ PLm 504 C 004 | 408 | 1616 |
CZ Pn XV A 10 | 624--674 | 1350--1400 |
CZ Pu I D 20 | 624--674 | 1350--1400 |
CZ Pu XVII E 1 | 474--524 | 1500--1550 |
D KA Aug LX | 624--924 | 1100--1400 |
D KNd 1161 | 799--849 | 1175--1225 |
F Pn lat 12044 | 874--924 | 1100--1150 |
F Pn lat 15181 | 674--724 | 1300--1350 |
NL Uu 406 | 624--924 | 1100--1400 |
Root | 1124--1324 | 700--900 |
The result for this analysis is stored in 02_divtime/analysis/tree7
.
The summarised tree file 02_divtime/analysis/tree7/posterior/alignment_and_trees.nexus.con.tre
is then read by figtree
in order to produce the Figure 1. This is carried out incorporating an offset of -408, reversing the time axis, and then plotting the HPD interval for the node ages and colouring branches with median IgrBranch rates. The tree is then plotted in units of years before present.
The script 03_anc_melody_inference/code/00_ancmelody_reconstruction.R
will take as input the calibrated tree in 02_divtime/analysis/tree7/posterior/alignment_and_trees.nexus.con.tre
and the alignment in 00_tree_inference/data/concatenated.nexus
. This step is very time-consuming and we used a cluster with 64 threads available for processing the 544 total melody positions for the six melodies analysed. The job was submitted with 03_anc_melody_inference/code/anc_melody_inference.pbs
. The analysis returns the maximum a posteriori melody reconstructions in tsv format for each of the melodies, for each of the nodes by picking the states with highest posterior pobability. It also saves an R binary data object with the results as that step is very slow to re-generate frequently.
We used mafft
again to align the edited reconstructions of five of the melodies availabler in the Solesmes edited compilation, but keeping the base aligment unchanged. This allowed us to track equivalent positions between the Solesmes melodies and our ancestral melody reconstrctions. This also allowed to calculate melody posterior probabilities for both reconstructions and compare them across the internal nodes. Multiple sequence alignment incorporating the Solesmes melodies in 04_solesmes_melodies/data
was carried out by the script 04_solesmes_melodies/code/00_align_to_msa.sh
.
Then the results in binary format from the ancestral melody reconstruction were loaded again and used to generate a table with melody posterior probability and its log for an arbitrary internal node ID, which is done by 04_solesmes_melodies/code/01_calculate_pp_solesmes.R
which is actually called iteratively over all the internal node IDs by 04_solesmes_melodies/code/01_calculate_pp_foreach_node.sh
. This script will generate tsv tables with the melody probabilities at a given internal node.
The Ancestral State Reconstruction results can be postprocessed back into a CantusCorpus-style
CSV with the postprocess_ancstate_into_csv.py
script. The result can then be loaded e.g. into
ChantLab for analysis.