Skip to content

Commit

Permalink
bug fix to read table of qCFs (#188)
Browse files Browse the repository at this point in the history
1. fix to delete uninformative 4-taxon sets, see this google group issue: https://groups.google.com/g/phylonetworks-users/c/w71ImPXIk58/m/ywrHVcIdBwAJ
2. fix #143 by adding new option 'mergerows' for custom use-cases
3. added tests for multiple alleles, and tested examples in manual on multiple alleles
  • Loading branch information
cecileane authored Oct 3, 2022
1 parent 5e2e477 commit b4fdce9
Show file tree
Hide file tree
Showing 6 changed files with 140 additions and 128 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ BioSequences = "2.0, 3"
BioSymbols = "4.0, 5"
CSV = "0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10"
Combinatorics = "0.7, 1.0"
DataFrames = "0.21, 0.22, 1.0"
DataFrames = "1.3"
DataStructures = "0.9, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18"
Distributions = "0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25"
FASTX = "1.1, 2"
Expand Down
66 changes: 39 additions & 27 deletions docs/src/man/multiplealleles.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
```@setup multialleles
using PhyloNetworks
```

# Multiple alleles per species

## between-species 4-taxon sets
Expand All @@ -7,20 +11,27 @@ to a taxon (a tip) in the network. If instead each allele/individual can be mapp
to a species, and if only the species-level network needs to be estimated,
then the following functions can be used:

```julia
tm = DataFrame(CSV.File(mappingFile)) # taxon map as a data frame
taxonmap = Dict(tm[i,:allele] => tm[i,:species] for i in 1:110) # taxon map as a dictionary
```@repl multialleles
using CSV, DataFrames
mappingfile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","mappingIndividuals.csv");
tm = CSV.read(mappingfile, DataFrame) # taxon map as a data frame
taxonmap = Dict(row[:individual] => row[:species] for row in eachrow(tm)) # taxon map as a dictionary
```

The mapping file can be a text (or `csv`) file with two columns (at least):
one column named `allele` and one named `species`,
mapping each allele name to a species name. Next, read in the gene trees
The [mapping file](https://github.com/crsl4/PhyloNetworks/blob/master/examples/mappingIndividuals.csv)
can be a text (or `csv`) file with two columns (at least):
one for the individuals, named `allele` or `individual`,
and one column containing the species names, named `species`.
Each row should map an allele name to a species name.
Next, read in the [gene trees](https://github.com/crsl4/PhyloNetworks/blob/master/examples/genetrees_alleletips.tre)
and calculate the quartet CFs at the species level:


```julia
genetrees = readMultiTopology(genetreeFile)
df_sp = writeTableCF(countquartetsintrees(genetrees, taxonmap)...)
```@repl multialleles
genetreefile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","genetrees_alleletips.tre");
genetrees = readMultiTopology(genetreefile);
sort(tipLabels(genetrees[1])) # multiple tips in species S1
df_sp = writeTableCF(countquartetsintrees(genetrees, taxonmap, showprogressbar=false)...)
```

Now `df_sp` is a data frame containing the quartet concordance factors
Expand All @@ -34,9 +45,10 @@ to calculated of the CF for `A,B,C,D` (such that the total weight from
this particular gene trees is 1).
It is safe to save this data frame, then use it for `snaq!` like this:

```julia
CSV.write("tableCF_species.csv", df) # to save the data frame to a file
d_sp = readTableCF("tableCF_species.csv") # to get a "DataCF" object for use in snaq!.
```@repl multialleles
CSV.write("tableCF_species.csv", df_sp); # to save the data frame to a file
d_sp = readTableCF("tableCF_species.csv"); # to get a "DataCF" object for use in snaq!
summarizeDataCF(d_sp)
```

## within-species 4-taxon sets
Expand All @@ -48,16 +60,17 @@ using this extra information. To get quartet CFs from sets of 4 individuals
in which 2 individuals are from the same species, the following functions
should be used:

```julia
df_ind = writeTableCF(countquartetsintrees(genetrees)...) # no mapping here: so quartet CFs across individuals
CSV.write("tableCF_individuals.csv", df) # to save to a file
df_sp = mapAllelesCFtable(mappingFile, "tableCF_individuals.csv");
d_sp = readTableCF!(df_sp);
```@repl multialleles
df_ind = writeTableCF(countquartetsintrees(genetrees, showprogressbar=false)...); # no mapping: CFs across individuals
first(df_ind, 4) # to see the first 4 rows
CSV.write("tableCF_individuals.csv", df_ind); # to save to a file
df_sp = mapAllelesCFtable(mappingfile, "tableCF_individuals.csv");
d_sp = readTableCF!(df_sp, mergerows=true);
```
where the mapping file can be a text (or `csv`) file with two columns
named `allele` and `species`, mapping each allele name to a species name.
named `allele` (or `individual`) and `species`, mapping each allele name to a species name.
The data in `df_ind` is the table of concordance factors at the level of individuals.
In other words, it list CFs using one row for each set of 4 alleles/individuals.
In other words, it lists CFs using one row for each set of 4 alleles/individuals.

`mapAllelesCFtable` creates a new data frame `df_sp` of quartet concordance factors at the
species level: with the allele names replaced by the appropriate species names.
Expand All @@ -81,9 +94,9 @@ But before, it is safe to save the concordance factor of quartets of species,
which can be calculated by averaging the CFs of quartets of individuals
from the associated species:

```julia
```@repl multialleles
df_sp = writeTableCF(d_sp) # data frame, quartet CFs averaged across individuals of same species
CSV.write("CFtable_species.csv", df_sp) # save to file
CSV.write("CFtable_species.csv", df_sp); # save to file
```

Some quartets have the same species repeated twice,
Expand Down Expand Up @@ -115,15 +128,14 @@ This can be done as in the first section ("between-species 4-taxon sets")
to give equal weight to all genes,
or as shown below to give more weight to genes that have more alleles:

```julia
df_sp = writeTableCF(d_sp) # some quartets have the same species twice
```@repl multialleles
first(df_sp, 3) # some quartets have the same species twice
function hasrep(row) # see if a row (4-taxon set) has a species name ending with "__2": repeated species
occursin(r"__2$", row[:tx1]) || occursin(r"__2$", row[:tx2]) ||
occursin(r"__2$", row[:tx3]) || occursin(r"__2$", row[:tx4])
occursin(r"__2$", row[:t1]) || occursin(r"__2$", row[:t2]) || # replace :t1 :t2 etc. by appropriate column names in your data,
occursin(r"__2$", row[:t3]) || occursin(r"__2$", row[:t4]) # e.g. by :taxon1 :taxon2 etc.
end
df_sp_reduced = filter(!hasrep, df_sp) # removes rows with repeated species
df_sp_reduced # should have fewer rows than df_sp
CSV.write("CFtable_species_norep.csv", df_sp_reduced) # to save to file
CSV.write("CFtable_species_norep.csv", df_sp_reduced); # to save to file
d_sp_reduced = readTableCF(df_sp_reduced) # DataCF object, for input to snaq!
```

Expand Down
16 changes: 16 additions & 0 deletions examples/genetrees_alleletips.tre
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
(((S4:1.4835,S5:1.4835):0.7879,((S1C:0.014,S1B:0.014):1.251,(S3:0.7704,S1A:0.7704):0.4946):1.0064):0.1762,S2:2.4476);
(((S3:1.2587,(S1C:0.6223,(S1A:0.1231,S1B:0.1231):0.4992):0.6364):1.1552,(S5:1.0332,S4:1.0332):1.3807):0.1454,S2:2.5592);
((S3:2.6379,(S4:0.8263,S5:0.8263):1.8116):0.212,((S1A:0.8435,(S1B:0.0441,S1C:0.0441):0.7995):0.8692,S2:1.7127):1.1372);
(S2:3.1288,(((S1A:0.2974,S1C:0.2974):0.4802,S3:0.7776):1.9636,(S1B:2.129,(S5:0.8288,S4:0.8288):1.3002):0.6122):0.3876);
(((S5:2.098,S4:2.098):0.6214,((S3:0.7126,(S1B:0.5033,S1A:0.5033):0.2094):0.8673,S1C:1.58):1.1394):0.8885,S2:3.6079);
((((S1B:1.2942,S1A:1.2942):0.8796,(S4:1.3542,S5:1.3542):0.8195):0.2455,(S3:0.9679,S1C:0.9679):1.4514):2.9787,S2:5.398);
((((S1B:0.0372,S1A:0.0372):1.2205,S1C:1.2577):0.2329,S3:1.4906):0.9159,(S2:1.932,(S5:0.9554,S4:0.9554):0.9766):0.4745);
(S3:4.6311,(((S2:1.4604,S5:1.4604):0.4353,S4:1.8957):0.2433,(S1C:0.1665,(S1A:0.1024,S1B:0.1024):0.0642):1.9725):2.4921);
((S2:1.4673,(S5:0.7265,S4:0.7265):0.7409):1.0067,(S3:2.1067,((S1B:0.2711,S1C:0.2711):1.0749,S1A:1.3461):0.7606):0.3673);
((((S1B:0.0573,S1C:0.0573):0.3884,S1A:0.4457):0.4876,S3:0.9333):1.9679,((S4:1.5713,S5:1.5713):0.9966,S2:2.568):0.3333);
((S1A:3.036,((S5:0.7604,S4:0.7604):1.6442,(S3:1.1176,(S1C:0.0465,S1B:0.0465):1.0711):1.287):0.6314):1.7552,S2:4.7912);
((S2:1.4523,((S1B:0.4758,(S1A:0.0138,S1C:0.0138):0.462):0.3946,S3:0.8704):0.5818):1.3858,(S5:1.6356,S4:1.6356):1.2024);
((S2:2.3144,(S4:0.7552,S5:0.7552):1.5592):0.8309,(S3:1.1027,((S1A:0.1344,S1C:0.1344):0.5612,S1B:0.6956):0.4071):2.0425);
(((S5:1.5464,S4:1.5464):1.9521,((S1B:0.6516,(S1A:0.0281,S1C:0.0281):0.6235):0.172,S3:0.8236):2.6749):1.8799,S2:5.3785);
(((S4:1.5191,S2:1.5191):0.7628,((S1A:1.1262,(S1B:0.3126,S1C:0.3126):0.8135):0.0776,S3:1.2038):1.0782):2.4257,S5:4.7076);
(((S3:1.8527,(S1C:1.2485,(S1B:0.4803,S1A:0.4803):0.7682):0.6042):0.4991,(S5:1.7785,S4:1.7785):0.5734):0.1705,S2:2.5224);
127 changes: 45 additions & 82 deletions src/multipleAlleles.jl
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ in the example below, this file is best read later with the option
```julia
mapAllelesCFtable("allele-species-map.csv", "allele-quartet-CF.csv";
filename = "quartetCF_speciesNames.csv")
df_sp = DataFrame(CSV.File("quartetCF_speciesNames.csv"); copycols=false); # DataFrame object
dataCF_specieslevel = readTableCF!(df_sp); # DataCF object
df_sp = CSV.read("quartetCF_speciesNames.csv", DataFrame); # DataFrame object
dataCF_specieslevel = readTableCF!(df_sp, mergerows=true); # DataCF object
```
"""
function mapAllelesCFtable(alleleDF::AbstractString, cfDF::AbstractString;
Expand Down Expand Up @@ -65,12 +65,12 @@ as its first argument.
function mapAllelesCFtable!(cfDF::DataFrame, alleleDF::DataFrame, co::Vector{Int},write::Bool,filename::AbstractString)
size(cfDF,2) >= 7 || error("CF DataFrame should have 7+ columns: 4taxa, 3CF, and possibly ngenes")
if length(co)==0 co=[1,2,3,4]; end
compareTaxaNames(alleleDF,cfDF,co)
allelecol, speciescol = compareTaxaNames(alleleDF,cfDF,co)
for j in 1:4
for ia in 1:size(alleleDF,1) # for all alleles
cfDF[!,co[j]] = map(x->replace(string(x),
Regex("^$(string(alleleDF[ia,:allele]))\$") =>
alleleDF[ia,:species]),
Regex("^$(string(alleleDF[ia,allelecol]))\$") =>
alleleDF[ia,speciescol]),
cfDF[!,co[j]])
end
end
Expand All @@ -85,98 +85,50 @@ end
# inside readTableCF!
# by deleting rows that are not informative like sp1 sp1 sp1 sp2
# keepOne=true: we only keep one allele per species
function cleanAlleleDF!(newdf::DataFrame, cols::Vector{Int};keepOne=false::Bool)
withngenes = (length(cols)==8)
function cleanAlleleDF!(newdf::DataFrame, cols::Vector{<:Integer}; keepOne=false::Bool)
delrows = Int[] # indices of rows to delete
repSpecies = String[]
repSpecies = Set{String}()
if(isa(newdf[1,cols[1]],Integer)) #taxon names as integers: we need this to be able to add __2
newdf[!,cols[1]] = map(string, newdf[!,cols[1]])
newdf[!,cols[2]] = map(string, newdf[!,cols[2]])
newdf[!,cols[3]] = map(string, newdf[!,cols[3]])
newdf[!,cols[4]] = map(string, newdf[!,cols[4]])
for j in 1:4
newdf[!,cols[j]] .= map(string, newdf[!,cols[j]])
end
end
row = Vector{String}(undef, 4)
for i in 1:size(newdf,1) #check all rows
@debug "row number: $i"
# fixit: check for no missing value, or error below
for i in 1:nrow(newdf)
map!(j -> newdf[i,cols[j]], row, 1:4)
@debug "row $(row)"
uniq = unique(row)
@debug "unique $(uniq)"

keep = false # default: used if 1 unique name, or 2 in some cases
if(length(uniq) == 4)
keep = true
else
if(!keepOne)
if(length(uniq) == 3) #sp1 sp1 sp2 sp3
continue
end
# by now, at least 1 species is repeated
if !keepOne # then we may choose to keep this row
# 3 options: sp1 sp1 sp2 sp3; or sp1 sp1 sp2 sp2 (keep)
# or sp1 sp1 sp1 sp2; or sp1 sp1 sp1 sp1 (do not keep)
keep = false
for u in uniq
ind = row .== u # indices of taxon names matching u
if sum(ind) == 2
keep = true
for u in uniq
@debug "u $(u), typeof $(typeof(u))"
ind = row .== u #taxon names matching u
@debug "taxon names matching u $(ind)"
if(sum(ind) == 2)
push!(repSpecies,string(u))
found = false
for k in 1:4
if(ind[k])
if(found)
@debug "found the second one in k $(k), will change newdf[i,cols[k]] $(newdf[i,cols[k]]), typeof $(typeof(newdf[i,cols[k]]))"
newdf[i,cols[k]] = string(u, repeatAlleleSuffix)
break
else
found = true
end
end
end
break
end
end
elseif(length(uniq) == 2)
# keep was initialized to false
for u in uniq
@debug "length uniq is 2, u $(u)"
ind = row .== u
if(sum(ind) == 1 || sum(ind) == 3)
@debug "ind $(ind) is 1 or 3, should not keep"
break
elseif(sum(ind) == 2)
@debug "ind $(ind) is 2, should keep"
keep = true
found = false
push!(repSpecies,string(u))
for k in 1:4
if(ind[k])
if(found)
newdf[i,cols[k]] = string(u, repeatAlleleSuffix)
break
else
found = true
end
end
end
end
end
push!(repSpecies, string(u))
# change the second instance of a repeated taxon name with suffix
k = findlast(ind)
newdf[i,cols[k]] = string(u, repeatAlleleSuffix)
end
@debug "after if, keep is $(keep)"
end
end
keep || push!(delrows, i)
@debug "" keep
end
@debug "" delrows
@debug "" repSpecies
nrows = size(newdf,1)
nkeep = nrows - length(delrows)
if nkeep < nrows
print("""found $(length(delrows)) 4-taxon sets uninformative about between-species relationships, out of $(nrows).
These 4-taxon sets will be deleted from the data frame. $nkeep informative 4-taxon sets will be used.
""")
nkeep > 0 || @warn "All 4-taxon subsets are uninformative, so the dataframe will be left empty"
deleterows!(newdf, delrows)
deleteat!(newdf, delrows) # deleteat! requires DataFrames 1.3
end
# @show size(newdf)
return unique(repSpecies)
return collect(repSpecies)
end


Expand Down Expand Up @@ -240,10 +192,9 @@ end
# function to compare the taxon names in the allele-species matching table
# and the CF table
function compareTaxaNames(alleleDF::DataFrame, cfDF::DataFrame, co::Vector{Int})
checkMapDF(alleleDF)
#println("found $(length(alleleDF[1])) allele-species matches")
allelecol, speciescol = checkMapDF(alleleDF)
CFtaxa = string.(mapreduce(x -> unique(skipmissing(x)), union, eachcol(cfDF[!,co[1:4]])))
alleleTaxa = map(string, alleleDF[!,:allele]) # as string, too
alleleTaxa = map(string, alleleDF[!,allelecol]) # as string, too
sizeCF = length(CFtaxa)
sizeAllele = length(alleleTaxa)
if sizeAllele > sizeCF
Expand All @@ -260,14 +211,26 @@ function compareTaxaNames(alleleDF::DataFrame, cfDF::DataFrame, co::Vector{Int})
for n in unchanged warnmsg *= " $n"; end
@warn warnmsg
end
return nothing
return allelecol, speciescol
end

# function to check that the allele df has one column labelled alleles and one column labelled species
"""
checkMapDF(mapping_allele2species::DataFrame)
Check that the data frame has one column named "allele" or "individual",
and one column named "species". Output: indices of these column.
"""
function checkMapDF(alleleDF::DataFrame)
size(alleleDF,2) >= 2 || error("Allele-Species matching Dataframe should have at least 2 columns")
:allele in DataFrames.propertynames(alleleDF) || error("In allele mapping file there is no column named allele")
:species in DataFrames.propertynames(alleleDF) || error("In allele mapping file there is no column named species")
colnames = DataFrames.propertynames(alleleDF)
allelecol = findfirst(x -> x == :allele, colnames)
if isnothing(allelecol)
allelecol = findfirst(x -> x == :individual, colnames)
end
isnothing(allelecol) && error("In allele mapping file there is no column named 'allele' or 'individual'")
speciescol = findfirst(x -> x == :species, colnames)
isnothing(speciescol) && error("In allele mapping file there is no column named species")
return allelecol, speciescol
end


Expand Down
Loading

2 comments on commit b4fdce9

@cecileane
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register

Release notes:

Edge type: parametrized by its Node type to use a vector of concrete type
Improved manual & tests about multiple alleles

bug fixes:

removal of degree-2 nodes, used in deleteleaf!
removal of hybrid edges with low inheritance
read table of qCFs: bug due to compatibility; now requires DataFrames v1.3

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/69444

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.15.3 -m "<description of version>" b4fdce9968dedc2561da1805408ce1dd5e1c98c2
git push origin v0.15.3

Please sign in to comment.