-
Notifications
You must be signed in to change notification settings - Fork 50
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
StatsModels v0.6.0 compatibility; REQUIRE -> Project.toml
also: - v0.9.1 and this commit not compatible with GLM v1.2.0+ would need to cap v0.9.1 to use GLM v1.1.1. - LinPredModel -> GLM.LinPredModel for compatibility with GLM v1.2.0+ - fitdiscrete: * better wrapper * empirical & stationary distributions * name change from fitDiscrete * rate variation: category median rates normalized * fixed SM traitlabels2indices error
- Loading branch information
1 parent
1431723
commit 985b925
Showing
16 changed files
with
644 additions
and
384 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
name = "PhyloNetworks" | ||
uuid = "33ad39ac-ed31-50eb-9b15-43d0656eaa72" | ||
license = "MIT" | ||
version = "0.10.0" | ||
|
||
[deps] | ||
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59" | ||
BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9" | ||
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" | ||
Combinatorics = "861a8166-3701-5b0c-9a16-15d98fcdc6aa" | ||
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" | ||
DataStructures = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8" | ||
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a" | ||
Distributed = "8ba89e20-285c-5b6f-9357-94700520ee1b" | ||
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f" | ||
GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a" | ||
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" | ||
NLopt = "76087f3c-5699-56af-9a33-bf431cd00edd" | ||
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7" | ||
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" | ||
SpecialFunctions = "276daf66-3868-5448-9aa4-cd146d93841b" | ||
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182" | ||
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" | ||
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91" | ||
StatsFuns = "4c63d2b9-4356-54db-8cca-17b64c39e42c" | ||
StatsModels = "3eaba693-59b7-5ba5-a881-562e759f1c8d" | ||
|
||
[compat] | ||
BioSequences = "≥ 1.0.0" | ||
BioSymbols = "≥ 3.0.0" | ||
CSV = "≥ 0.4.0" | ||
Combinatorics = "≥ 0.7.0" | ||
DataFrames = "≥ 0.13.0" | ||
DataStructures = "≥ 0.9.0" | ||
Distributions = "≥ 0.15.0" | ||
GLM = "≥ 1.0.0" | ||
NLopt = "≥ 0.5.1" | ||
SpecialFunctions = "≥ 0.7.0" | ||
StaticArrays = "≥ 0.8.3" | ||
StatsBase = "≥ 0.26.0" | ||
StatsFuns = "≥ 0.7.0" | ||
StatsModels = "≥ 0.6.0" | ||
julia = "≥ 0.7.0" | ||
|
||
[extras] | ||
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" | ||
|
||
[targets] | ||
test = ["Test"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,137 +1,142 @@ | ||
```@setup traitevol_fixednet | ||
using PhyloNetworks | ||
using DataFrames | ||
mkpath("../assets/figures") | ||
``` | ||
|
||
# Discrete Trait Evolution | ||
|
||
With a phylogenetic network structure inferred, we can now estimate how quickly traits | ||
have evolved over time using a likelihood model. These traits should be discrete | ||
characteristics of a species such as feather color, diet type, or dna in aligned | ||
genetic sequences. | ||
have evolved over time using a likelihood model. These traits should be discrete | ||
characteristics of a species such as feather color, diet type, | ||
or DNA in aligned genetic sequences. | ||
|
||
## Discrete Trait Data | ||
|
||
As with continuous trait evolution, we assume a fixed network, correctly rooted, | ||
As with continuous trait evolution, we assume a fixed network, correctly rooted, | ||
with branch lengths proportional to calendar time. We start with a network, then | ||
add data about the tips of this network. We allow data of two types. | ||
1. A vector of species names with a data frame of traits: | ||
|
||
```@setup fitDiscrete | ||
using PhyloNetworks, DataFrames | ||
mkpath("../assets/figures") | ||
``` | ||
|
||
```@example fitDiscrete | ||
#read in network | ||
simple_net = readTopology("(A:3.0,(B:2.0,(C:1.0,D:1.0):1.0):1.0);"); | ||
#read in trait data | ||
simple_species = ["C","A","B","D"] | ||
simple_dat = DataFrame(trait=["hi","lo","lo","hi"]) | ||
``` | ||
|
||
If your species names and trait data are in the same data frame, read in your data | ||
frame then subset the data like this: | ||
```@example fitDiscrete | ||
dat = DataFrame(species=["C","A","B","D"], trait=["hi","lo","lo","hi"]) | ||
simple_species = dat[:species] | ||
simple_dat = DataFrame(trait = dat[:trait]) | ||
``` | ||
|
||
2. To use dna data, read in the network structure then start with a fasta | ||
file. Reading the data from this file using the `readfastatodna` function. This | ||
creates a data frame of dna data and a vector of dna pattern weights. | ||
|
||
```@example fitDiscrete | ||
#read in network | ||
dna_net = readTopology("((((((((((((((Ae_caudata_Tr275:1.0,Ae_caudata_Tr276:1.0):1.0,Ae_caudata_Tr139:1.0):1.0)#H1:1.0::0.6,((((((Ae_longissima_Tr241:1.0,Ae_longissima_Tr242:1.0):1.0,Ae_longissima_Tr355:1.0):1.0,(Ae_sharonensis_Tr265:1.0,Ae_sharonensis_Tr264:1.0):1.0):1.0,((Ae_bicornis_Tr408:1.0,Ae_bicornis_Tr407:1.0):1.0,Ae_bicornis_Tr406:1.0):1.0):1.0,((Ae_searsii_Tr164:1.0,Ae_searsii_Tr165:1.0):1.0,Ae_searsii_Tr161:1.0):1.0):1.0)#H2:1.0::0.6):1.0,(((Ae_umbellulata_Tr266:1.0,Ae_umbellulata_Tr257:1.0):1.0,Ae_umbellulata_Tr268:1.0):1.0,#H1:1.0::0.4):1.0):1.0,((Ae_comosa_Tr271:1.0,Ae_comosa_Tr272:1.0):1.0,(((Ae_uniaristata_Tr403:1.0,Ae_uniaristata_Tr357:1.0):1.0,Ae_uniaristata_Tr402:1.0):1.0,Ae_uniaristata_Tr404:1.0):1.0):1.0):1.0,(((Ae_tauschii_Tr352:1.0,Ae_tauschii_Tr351:1.0):1.0,(Ae_tauschii_Tr180:1.0,Ae_tauschii_Tr125:1.0):1.0):1.0,(#H2:1.0::0.4,((((Ae_mutica_Tr237:1.0,Ae_mutica_Tr329:1.0):1.0,Ae_mutica_Tr244:1.0):1.0,Ae_mutica_Tr332:1.0):1.0)#H4:1.0::0.6):1.0):1.0):1.0,(((T_boeoticum_TS8:1.0,(T_boeoticum_TS10:1.0,T_boeoticum_TS3:1.0):1.0):1.0,T_boeoticum_TS4:1.0):1.0,((T_urartu_Tr315:1.0,T_urartu_Tr232:1.0):1.0,(T_urartu_Tr317:1.0,T_urartu_Tr309:1.0):1.0):1.0):1.0):1.0,(((((Ae_speltoides_Tr320:1.0,Ae_speltoides_Tr323:1.0):1.0,Ae_speltoides_Tr223:1.0):1.0,Ae_speltoides_Tr251:1.0):1.0):1.0,#H4:1.0::0.4):1.0):1.0):1.0,Ta_caputMedusae_TB2:1.0):1.0,S_vavilovii_Tr279:1.0):1.0,Er_bonaepartis_TB1:1.0):1.0,H_vulgare_HVens23:1.0);"); | ||
1. A vector of species names with a data frame of traits: | ||
|
||
#read in dna data | ||
fastafile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples", | ||
"Ae_bicornis_Tr406_Contig10132.aln") | ||
dna_dat, dna_weights = readfastatodna(fastafile, true); | ||
``` | ||
```@example traitevol_fixednet | ||
# read in network | ||
net = readTopology("(A:3,((B:0.4)#H1:1.6::0.92,((C:0.4,#H1:0::0.08):0.6,D:1):1):1);"); | ||
# read in trait data | ||
species = ["C","A","B","D"] | ||
dat = DataFrame(trait=["hi","lo","lo","hi"]) | ||
``` | ||
|
||
If your species names and trait data are in the same data frame, | ||
read in your data frame then subset the data like this: | ||
```@example traitevol_fixednet | ||
dat = DataFrame(species=["C","A","B","D"], trait=["hi","lo","lo","hi"]) | ||
species = dat[:species] | ||
dat = DataFrame(trait = dat[:trait]) | ||
``` | ||
|
||
2. To use dna data, read in the network structure then start with a fasta | ||
file. Reading the data from this file using the `readfastatodna` function. | ||
This creates a data frame of dna data and a vector of dna pattern weights. | ||
|
||
```@example traitevol_fixednet | ||
# read in network | ||
dna_net = readTopology("((((((((((((((Ae_caudata_Tr275:1.0,Ae_caudata_Tr276:1.0):1.0,Ae_caudata_Tr139:1.0):1.0)#H1:1.0::0.6,((((((Ae_longissima_Tr241:1.0,Ae_longissima_Tr242:1.0):1.0,Ae_longissima_Tr355:1.0):1.0,(Ae_sharonensis_Tr265:1.0,Ae_sharonensis_Tr264:1.0):1.0):1.0,((Ae_bicornis_Tr408:1.0,Ae_bicornis_Tr407:1.0):1.0,Ae_bicornis_Tr406:1.0):1.0):1.0,((Ae_searsii_Tr164:1.0,Ae_searsii_Tr165:1.0):1.0,Ae_searsii_Tr161:1.0):1.0):1.0)#H2:1.0::0.6):1.0,(((Ae_umbellulata_Tr266:1.0,Ae_umbellulata_Tr257:1.0):1.0,Ae_umbellulata_Tr268:1.0):1.0,#H1:1.0::0.4):1.0):1.0,((Ae_comosa_Tr271:1.0,Ae_comosa_Tr272:1.0):1.0,(((Ae_uniaristata_Tr403:1.0,Ae_uniaristata_Tr357:1.0):1.0,Ae_uniaristata_Tr402:1.0):1.0,Ae_uniaristata_Tr404:1.0):1.0):1.0):1.0,(((Ae_tauschii_Tr352:1.0,Ae_tauschii_Tr351:1.0):1.0,(Ae_tauschii_Tr180:1.0,Ae_tauschii_Tr125:1.0):1.0):1.0,(#H2:1.0::0.4,((((Ae_mutica_Tr237:1.0,Ae_mutica_Tr329:1.0):1.0,Ae_mutica_Tr244:1.0):1.0,Ae_mutica_Tr332:1.0):1.0)#H4:1.0::0.6):1.0):1.0):1.0,(((T_boeoticum_TS8:1.0,(T_boeoticum_TS10:1.0,T_boeoticum_TS3:1.0):1.0):1.0,T_boeoticum_TS4:1.0):1.0,((T_urartu_Tr315:1.0,T_urartu_Tr232:1.0):1.0,(T_urartu_Tr317:1.0,T_urartu_Tr309:1.0):1.0):1.0):1.0):1.0,(((((Ae_speltoides_Tr320:1.0,Ae_speltoides_Tr323:1.0):1.0,Ae_speltoides_Tr223:1.0):1.0,Ae_speltoides_Tr251:1.0):1.0):1.0,#H4:1.0::0.4):1.0):1.0):1.0,Ta_caputMedusae_TB2:1.0):1.0,S_vavilovii_Tr279:1.0):1.0,Er_bonaepartis_TB1:1.0):1.0,H_vulgare_HVens23:1.0);"); | ||
# read in dna data | ||
fastafile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","Ae_bicornis_Tr406_Contig10132.aln") | ||
dna_dat, dna_weights = readfastatodna(fastafile, true); | ||
dna_dat | ||
dna_weights | ||
``` | ||
|
||
## Choosing a Substitution Model | ||
|
||
After reading in your data, choose a model to describe how evolutionary changes | ||
(or substitutions, in the case of DNA) happened over time. We offer a selection | ||
of Markov substitution models to describe the evolutionary process. | ||
After reading in your data, choose a model to describe how evolutionary changes | ||
(or substitutions, in the case of DNA) happened over time. | ||
Available Markov substitution models are described below. | ||
|
||
### Generic Trait Models | ||
### Generic Trait Models | ||
|
||
These models works well for any type of trait we may want to model. For general | ||
trait types, use one of these three models: | ||
`:ERSM` Equal Rates Substitution Model | ||
`:BTSM` Binary Trait Substitution Model | ||
`:TBTSM` Two Binary Trait Substituion Model | ||
|
||
### DNA-Specific Models | ||
|
||
The DNA-specific models are optimized for aligned sequence data. We offer JC69 | ||
and HKY85 models in both relative and absolute versions. The JC69 model was | ||
developed by Jukes and Cantor in 1969 and uses one rate for all type of substitutions. | ||
The HKY85 model was developed in 1985 by Hasegawa, Kishino, & Yano. It treats | ||
transitions differently from transversions. | ||
`:JC69` Jukes Cantor 69 Model | ||
`:HKY85` Hasegawa, Kishino and Yano 1985 | ||
|
||
## Running FitDiscrete | ||
|
||
To infer evolutionary rates, run the `fitDiscrete` function on the network and data. | ||
It will calculate the maximum likelihood score of a network given one or more | ||
discrete trait characters at the tips. Along each edge, evolutionary changes | ||
are modeled with a continous time Markov model, with parameters estimated by | ||
maximizing the likelihood. At each hybrid node, the trait is assumed to be | ||
inherited from the immediate parent (or parents, in the case of a hybrid edge). | ||
If there is a hybrid edge, the trait is modeled according to the parents' weighted | ||
average genetic contributions, as measured by inheritance gamma γ. The model | ||
currently ignores incomplete lineage sorting. | ||
trait types, use one of these three models: | ||
- `:BTSM` Binary Trait Substitution Model (2 states, rates unconstrained) | ||
- `:ERSM` Equal Rates Substitution Model | ||
(`k` states, all transitions possible with equal rates) | ||
- `:TBTSM` Two Binary Trait Substituion Model (though not fully implemented yet) | ||
|
||
### DNA-Specific Models | ||
|
||
The DNA-specific models are optimized for aligned sequence data. | ||
The 4 nucleotide states are from | ||
[BioSymbols](https://github.com/BioJulia/BioSymbols.jl) | ||
(listed [here](http://biojulia.net/BioSymbols.jl/stable/nucleicacids/)). | ||
Each model has a relative and an absolute version. | ||
- `:JC69` Jukes & Cantor 1969 model: one single rate for all transitions. | ||
The relative version has values -1 along the diagonal of the rate matrix | ||
(1 expected transition / unit of time). The absolute version has an extra | ||
parameter to scale the rate matrix. | ||
- `:HKY85` Hasegawa, Kishino & Yano 1985: treats transitions differently | ||
from transversions. | ||
|
||
## Fitting the model | ||
|
||
To infer evolutionary rates, run the `fitdiscrete` function on the network and data. | ||
It will calculate the maximum likelihood score of a fixed network | ||
given one or more discrete trait characters at the tips. | ||
Along each edge, evolutionary changes | ||
are modeled with a continous time Markov model, with parameters estimated by | ||
maximizing the likelihood. At each hybrid node, the trait is assumed to be | ||
inherited from the immediate parent (or parents, in the case of a hybrid node). | ||
At a hybrid node, the trait is assumed to be inherited from one or the other | ||
parent, with probabilities equal to the inheritance γ of each parent edge | ||
(which is given by the network). | ||
The model ignores incomplete lineage sorting (e.g. hemiplasy). | ||
|
||
### General Trait Data | ||
|
||
```@example fitDiscrete | ||
s1 = fitDiscrete(simple_net, :ERSM, simple_species, simple_dat; optimizeQ=false, | ||
optimizeRVAS=false) | ||
s2 = fitDiscrete(simple_net, :BTSM, simple_species, simple_dat; optimizeQ=false, | ||
optimizeRVAS=false) | ||
```@repl traitevol_fixednet | ||
s1 = fitdiscrete(net, :ERSM, species, dat; optimizeQ=false) | ||
s2 = fitdiscrete(net, :BTSM, species, dat; optimizeQ=false) | ||
``` | ||
In this `fitDiscrete` call, we do not optimize rates or allow for rate variation | ||
across sites. | ||
|
||
If `optimizeQ = true`, the `fitDiscrete` function estimates the evolutionary rate | ||
or rates. Because we didn't allow for rate variation across sites in these models, | ||
we do not optimize the way rates may vary across trait types. | ||
|
||
```@example fitDiscrete | ||
s3 = fitDiscrete(simple_net, :ERSM, simple_species, simple_dat; optimizeQ=true, | ||
optimizeRVAS=true) | ||
s4 = fitDiscrete(simple_net, :BTSM, simple_species, simple_dat; optimizeQ=true, | ||
optimizeRVAS=true) | ||
In this `fitdiscrete` call, we do not optimize rates or allow for rate variation | ||
across sites. The default rates (which act as starting value if rates | ||
were to be optimized) are chosen equal to the inverse of the total edge lengths | ||
in the network (or 1/ntax if all branch lengths are missing). | ||
|
||
If `optimizeQ = true` (which is the default), the `fitdiscrete` | ||
function estimates the parameters of the rate matrix. | ||
Because we didn't allow for rate variation across sites in these models, | ||
there is nothing to optimize in the way rates may vary across traits (sites). | ||
|
||
```@repl traitevol_fixednet | ||
s3 = fitdiscrete(net, :ERSM, species, dat) | ||
s4 = fitdiscrete(net, :BTSM, species, dat) | ||
``` | ||
|
||
### DNA Data | ||
|
||
For DNA data, use `:JC69` or `:HKY85` models. | ||
```@example fitDiscrete | ||
d1 = fitDiscrete(dna_net, :JC69, dna_dat, dna_weights, :RV; optimizeQ=false, | ||
optimizeRVAS=false) | ||
d2 = fitDiscrete(dna_net, :HKY85, dna_dat, dna_weights, :RV; optimizeQ=false, | ||
optimizeRVAS=false) | ||
For DNA data, use one of `:JC69` or `:HKY85`. | ||
To allow for rate variation across sites, use the `:RV` option. | ||
|
||
```@example traitevol_fixednet | ||
d1 = fitdiscrete(dna_net, :JC69, dna_dat, dna_weights, :RV; optimizeQ=false, optimizeRVAS=false) | ||
d2 = fitdiscrete(dna_net, :HKY85, dna_dat, dna_weights, :RV; optimizeQ=false, optimizeRVAS=false) | ||
``` | ||
In these `fitDiscrete` models, we do not optimize rates (`optimizeQ=false`), but | ||
we do allow for rate variation across sites. | ||
In these `fitdiscrete` models, we do not optimize rates (`optimizeQ=false`), but | ||
we do allow for rate variation across sites, with a default α of 1. | ||
|
||
### Rate Variation Across Sites | ||
|
||
In its default version, `fitDiscrete` does not allow for rate variation across sites. | ||
To allow for rate variation across sites in your estimate of evolutionary rates | ||
(or rate variation across trait types, in the case of general trait types), | ||
include `:RV`. If you include `:RV` and `optimizeRVAS = true`, the model will | ||
not only allow for rate variation, but it will also optimize how rates vary across | ||
sites. | ||
In its default version, `fitdiscrete` does not allow for rate variation across sites. | ||
To allow for rate variation across sites in your estimate of evolutionary rates | ||
(or rate variation across traits in the case of general traits), | ||
include `:RV`. If you include `:RV` and `optimizeRVAS = true`, | ||
the model will allow for rate variation and | ||
also optimize the parameter α of the distribution of rates across sites. | ||
|
||
We optimize the evolutionary rates and the way rates vary across sites for the | ||
DNA data here: | ||
```@example fitDiscrete | ||
d3 = fitDiscrete(dna_net, :JC69, dna_dat, dna_weights, :RV; optimizeQ=true, | ||
optimizeRVAS=false) | ||
d4 = fitDiscrete(dna_net, :HKY85, dna_dat, dna_weights, :RV; optimizeQ=true, | ||
optimizeRVAS=false) | ||
``` | ||
DNA data here: | ||
```@repl traitevol_fixednet | ||
d3 = fitdiscrete(dna_net, :JC69, dna_dat, dna_weights, :RV; optimizeRVAS=false) | ||
d4 = fitdiscrete(dna_net, :HKY85, dna_dat, dna_weights, :RV) | ||
``` |
Oops, something went wrong.