bug fix to read table of qCFs (#188)

1. fix to delete uninformative 4-taxon sets, see this google group issue: https://groups.google.com/g/phylonetworks-users/c/w71ImPXIk58/m/ywrHVcIdBwAJ 2. fix #143 by adding new option 'mergerows' for custom use-cases 3. added tests for multiple alleles, and tested examples in manual on multiple alleles
JuliaPhylo · Oct 3, 2022 · b4fdce9 · b4fdce9 · cecileane · Oct 3, 2022
1 parent 5e2e477
commit b4fdce9
Show file tree

Hide file tree

Showing 6 changed files with 140 additions and 128 deletions.
diff --git a/Project.toml b/Project.toml
@@ -31,7 +31,7 @@ BioSequences = "2.0, 3"
 BioSymbols = "4.0, 5"
 CSV = "0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10"
 Combinatorics = "0.7, 1.0"
-DataFrames = "0.21, 0.22, 1.0"
+DataFrames = "1.3"
 DataStructures = "0.9, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18"
 Distributions = "0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25"
 FASTX = "1.1, 2"

diff --git a/docs/src/man/multiplealleles.md b/docs/src/man/multiplealleles.md
@@ -1,3 +1,7 @@
+```@setup multialleles
+using PhyloNetworks
+```
+
 # Multiple alleles per species
 
 ## between-species 4-taxon sets
@@ -7,20 +11,27 @@ to a taxon (a tip) in the network. If instead each allele/individual can be mapp
 to a species, and if only the species-level network needs to be estimated,
 then the following functions can be used:
 
-```julia
-tm = DataFrame(CSV.File(mappingFile)) # taxon map as a data frame
-taxonmap = Dict(tm[i,:allele] => tm[i,:species] for i in 1:110) # taxon map as a dictionary
+```@repl multialleles
+using CSV, DataFrames
+mappingfile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","mappingIndividuals.csv");
+tm = CSV.read(mappingfile, DataFrame) # taxon map as a data frame
+taxonmap = Dict(row[:individual] => row[:species] for row in eachrow(tm)) # taxon map as a dictionary
 ```
 
-The mapping file can be a text (or `csv`) file with two columns (at least):
-one column named `allele` and one named `species`,
-mapping each allele name to a species name. Next, read in the gene trees
+The [mapping file](https://github.com/crsl4/PhyloNetworks/blob/master/examples/mappingIndividuals.csv)
+can be a text (or `csv`) file with two columns (at least):
+one for the individuals, named `allele` or `individual`,
+and one column containing the species names, named `species`.
+Each row should map an allele name to a species name.
+Next, read in the [gene trees](https://github.com/crsl4/PhyloNetworks/blob/master/examples/genetrees_alleletips.tre)
 and calculate the quartet CFs at the species level:
 
 
-```julia
-genetrees = readMultiTopology(genetreeFile)
-df_sp = writeTableCF(countquartetsintrees(genetrees, taxonmap)...)
+```@repl multialleles
+genetreefile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","genetrees_alleletips.tre");
+genetrees = readMultiTopology(genetreefile);
+sort(tipLabels(genetrees[1])) # multiple tips in species S1
+df_sp = writeTableCF(countquartetsintrees(genetrees, taxonmap, showprogressbar=false)...)
 ```
 
 Now `df_sp` is a data frame containing the quartet concordance factors
@@ -34,9 +45,10 @@ to calculated of the CF for `A,B,C,D` (such that the total weight from
 this particular gene trees is 1).
 It is safe to save this data frame, then use it for `snaq!` like this:
 
-```julia
-CSV.write("tableCF_species.csv", df)      # to save the data frame to a file
-d_sp = readTableCF("tableCF_species.csv") # to get a "DataCF" object for use in snaq!.
+```@repl multialleles
+CSV.write("tableCF_species.csv", df_sp);   # to save the data frame to a file
+d_sp = readTableCF("tableCF_species.csv"); # to get a "DataCF" object for use in snaq!
+summarizeDataCF(d_sp)
 ```
 
 ## within-species 4-taxon sets
@@ -48,16 +60,17 @@ using this extra information. To get quartet CFs from sets of 4 individuals
 in which 2 individuals are from the same species, the following functions
 should be used:
 
-```julia
-df_ind = writeTableCF(countquartetsintrees(genetrees)...) # no mapping here: so quartet CFs across individuals
-CSV.write("tableCF_individuals.csv", df)               # to save to a file
-df_sp = mapAllelesCFtable(mappingFile, "tableCF_individuals.csv");
-d_sp = readTableCF!(df_sp);
+```@repl multialleles
+df_ind = writeTableCF(countquartetsintrees(genetrees, showprogressbar=false)...); # no mapping: CFs across individuals
+first(df_ind, 4) # to see the first 4 rows
+CSV.write("tableCF_individuals.csv", df_ind);  # to save to a file
+df_sp = mapAllelesCFtable(mappingfile, "tableCF_individuals.csv");
+d_sp = readTableCF!(df_sp, mergerows=true);
 ```
 where the mapping file can be a text (or `csv`) file with two columns
-named `allele` and `species`, mapping each allele name to a species name.
+named `allele` (or `individual`) and `species`, mapping each allele name to a species name.
 The data in `df_ind` is the table of concordance factors at the level of individuals.
-In other words, it list CFs using one row for each set of 4 alleles/individuals.
+In other words, it lists CFs using one row for each set of 4 alleles/individuals.
 
 `mapAllelesCFtable` creates a new data frame `df_sp` of quartet concordance factors at the
 species level: with the allele names replaced by the appropriate species names.
@@ -81,9 +94,9 @@ But before, it is safe to save the concordance factor of quartets of species,
 which can be calculated by averaging the CFs of quartets of individuals
 from the associated species:
 
-```julia
+```@repl multialleles
 df_sp = writeTableCF(d_sp) # data frame, quartet CFs averaged across individuals of same species
-CSV.write("CFtable_species.csv", df_sp) # save to file
+CSV.write("CFtable_species.csv", df_sp); # save to file
 ```
 
 Some quartets have the same species repeated twice,
@@ -115,15 +128,14 @@ This can be done as in the first section ("between-species 4-taxon sets")
 to give equal weight to all genes,
 or as shown below to give more weight to genes that have more alleles:
 
-```julia
-df_sp = writeTableCF(d_sp) # some quartets have the same species twice
+```@repl multialleles
+first(df_sp, 3) # some quartets have the same species twice
 function hasrep(row) # see if a row (4-taxon set) has a species name ending with "__2": repeated species
-  occursin(r"__2$", row[:tx1]) || occursin(r"__2$", row[:tx2]) ||
-    occursin(r"__2$", row[:tx3]) || occursin(r"__2$", row[:tx4])
+    occursin(r"__2$", row[:t1]) || occursin(r"__2$", row[:t2]) || # replace :t1 :t2 etc. by appropriate column names in your data,
+    occursin(r"__2$", row[:t3]) || occursin(r"__2$", row[:t4])    # e.g. by :taxon1 :taxon2 etc.
 end
 df_sp_reduced = filter(!hasrep, df_sp) # removes rows with repeated species
-df_sp_reduced # should have fewer rows than df_sp
-CSV.write("CFtable_species_norep.csv", df_sp_reduced) # to save to file
+CSV.write("CFtable_species_norep.csv", df_sp_reduced); # to save to file
 d_sp_reduced = readTableCF(df_sp_reduced) # DataCF object, for input to snaq!
 ```
 

diff --git a/examples/genetrees_alleletips.tre b/examples/genetrees_alleletips.tre
@@ -0,0 +1,16 @@
+(((S4:1.4835,S5:1.4835):0.7879,((S1C:0.014,S1B:0.014):1.251,(S3:0.7704,S1A:0.7704):0.4946):1.0064):0.1762,S2:2.4476);
+(((S3:1.2587,(S1C:0.6223,(S1A:0.1231,S1B:0.1231):0.4992):0.6364):1.1552,(S5:1.0332,S4:1.0332):1.3807):0.1454,S2:2.5592);
+((S3:2.6379,(S4:0.8263,S5:0.8263):1.8116):0.212,((S1A:0.8435,(S1B:0.0441,S1C:0.0441):0.7995):0.8692,S2:1.7127):1.1372);
+(S2:3.1288,(((S1A:0.2974,S1C:0.2974):0.4802,S3:0.7776):1.9636,(S1B:2.129,(S5:0.8288,S4:0.8288):1.3002):0.6122):0.3876);
+(((S5:2.098,S4:2.098):0.6214,((S3:0.7126,(S1B:0.5033,S1A:0.5033):0.2094):0.8673,S1C:1.58):1.1394):0.8885,S2:3.6079);
+((((S1B:1.2942,S1A:1.2942):0.8796,(S4:1.3542,S5:1.3542):0.8195):0.2455,(S3:0.9679,S1C:0.9679):1.4514):2.9787,S2:5.398);
+((((S1B:0.0372,S1A:0.0372):1.2205,S1C:1.2577):0.2329,S3:1.4906):0.9159,(S2:1.932,(S5:0.9554,S4:0.9554):0.9766):0.4745);
+(S3:4.6311,(((S2:1.4604,S5:1.4604):0.4353,S4:1.8957):0.2433,(S1C:0.1665,(S1A:0.1024,S1B:0.1024):0.0642):1.9725):2.4921);
+((S2:1.4673,(S5:0.7265,S4:0.7265):0.7409):1.0067,(S3:2.1067,((S1B:0.2711,S1C:0.2711):1.0749,S1A:1.3461):0.7606):0.3673);
+((((S1B:0.0573,S1C:0.0573):0.3884,S1A:0.4457):0.4876,S3:0.9333):1.9679,((S4:1.5713,S5:1.5713):0.9966,S2:2.568):0.3333);
+((S1A:3.036,((S5:0.7604,S4:0.7604):1.6442,(S3:1.1176,(S1C:0.0465,S1B:0.0465):1.0711):1.287):0.6314):1.7552,S2:4.7912);
+((S2:1.4523,((S1B:0.4758,(S1A:0.0138,S1C:0.0138):0.462):0.3946,S3:0.8704):0.5818):1.3858,(S5:1.6356,S4:1.6356):1.2024);
+((S2:2.3144,(S4:0.7552,S5:0.7552):1.5592):0.8309,(S3:1.1027,((S1A:0.1344,S1C:0.1344):0.5612,S1B:0.6956):0.4071):2.0425);
+(((S5:1.5464,S4:1.5464):1.9521,((S1B:0.6516,(S1A:0.0281,S1C:0.0281):0.6235):0.172,S3:0.8236):2.6749):1.8799,S2:5.3785);
+(((S4:1.5191,S2:1.5191):0.7628,((S1A:1.1262,(S1B:0.3126,S1C:0.3126):0.8135):0.0776,S3:1.2038):1.0782):2.4257,S5:4.7076);
+(((S3:1.8527,(S1C:1.2485,(S1B:0.4803,S1A:0.4803):0.7682):0.6042):0.4991,(S5:1.7785,S4:1.7785):0.5734):0.1705,S2:2.5224);
diff --git a/src/multipleAlleles.jl b/src/multipleAlleles.jl
@@ -33,8 +33,8 @@ in the example below, this file is best read later with the option
 ```julia
 mapAllelesCFtable("allele-species-map.csv", "allele-quartet-CF.csv";
                   filename = "quartetCF_speciesNames.csv")
-df_sp = DataFrame(CSV.File("quartetCF_speciesNames.csv"); copycols=false); # DataFrame object
-dataCF_specieslevel = readTableCF!(df_sp); # DataCF object
+df_sp = CSV.read("quartetCF_speciesNames.csv", DataFrame); # DataFrame object
+dataCF_specieslevel = readTableCF!(df_sp, mergerows=true); # DataCF object
 ```
 """
 function mapAllelesCFtable(alleleDF::AbstractString, cfDF::AbstractString;
@@ -65,12 +65,12 @@ as its first argument.
 function mapAllelesCFtable!(cfDF::DataFrame, alleleDF::DataFrame, co::Vector{Int},write::Bool,filename::AbstractString)
     size(cfDF,2) >= 7 || error("CF DataFrame should have 7+ columns: 4taxa, 3CF, and possibly ngenes")
     if length(co)==0 co=[1,2,3,4]; end
-    compareTaxaNames(alleleDF,cfDF,co)
+    allelecol, speciescol = compareTaxaNames(alleleDF,cfDF,co)
     for j in 1:4
         for ia in 1:size(alleleDF,1) # for all alleles
             cfDF[!,co[j]] = map(x->replace(string(x),
-                                         Regex("^$(string(alleleDF[ia,:allele]))\$") =>
-                                         alleleDF[ia,:species]),
+                                         Regex("^$(string(alleleDF[ia,allelecol]))\$") =>
+                                         alleleDF[ia,speciescol]),
                                 cfDF[!,co[j]])
         end
     end
@@ -85,98 +85,50 @@ end
 # inside readTableCF!
 # by deleting rows that are not informative like sp1 sp1 sp1 sp2
 # keepOne=true: we only keep one allele per species
-function cleanAlleleDF!(newdf::DataFrame, cols::Vector{Int};keepOne=false::Bool)
-    withngenes = (length(cols)==8)
+function cleanAlleleDF!(newdf::DataFrame, cols::Vector{<:Integer}; keepOne=false::Bool)
     delrows = Int[] # indices of rows to delete
-    repSpecies = String[]
+    repSpecies = Set{String}()
     if(isa(newdf[1,cols[1]],Integer)) #taxon names as integers: we need this to be able to add __2
-        newdf[!,cols[1]] = map(string, newdf[!,cols[1]])
-        newdf[!,cols[2]] = map(string, newdf[!,cols[2]])
-        newdf[!,cols[3]] = map(string, newdf[!,cols[3]])
-        newdf[!,cols[4]] = map(string, newdf[!,cols[4]])
+        for j in 1:4
+            newdf[!,cols[j]] .= map(string, newdf[!,cols[j]])
+        end
     end
     row = Vector{String}(undef, 4)
-    for i in 1:size(newdf,1) #check all rows
-        @debug "row number: $i"
-        # fixit: check for no missing value, or error below
+    for i in 1:nrow(newdf)
         map!(j -> newdf[i,cols[j]], row, 1:4)
-        @debug "row $(row)"
         uniq = unique(row)
-        @debug "unique $(uniq)"
 
-        keep = false # default: used if 1 unique name, or 2 in some cases
         if(length(uniq) == 4)
-            keep = true
-        else
-            if(!keepOne)
-                if(length(uniq) == 3) #sp1 sp1 sp2 sp3
+            continue
+        end
+        # by now, at least 1 species is repeated
+        if !keepOne # then we may choose to keep this row
+            # 3 options: sp1 sp1 sp2 sp3; or sp1 sp1 sp2 sp2 (keep)
+            #         or sp1 sp1 sp1 sp2; or sp1 sp1 sp1 sp1 (do not keep)
+            keep = false
+            for u in uniq
+                ind = row .== u # indices of taxon names matching u
+                if sum(ind) == 2
                     keep = true
-                    for u in uniq
-                        @debug "u $(u), typeof $(typeof(u))"
-                        ind = row .== u #taxon names matching u
-                        @debug "taxon names matching u $(ind)"
-                        if(sum(ind) == 2)
-                            push!(repSpecies,string(u))
-                            found = false
-                            for k in 1:4
-                                if(ind[k])
-                                    if(found)
-                                        @debug "found the second one in k $(k), will change newdf[i,cols[k]] $(newdf[i,cols[k]]), typeof $(typeof(newdf[i,cols[k]]))"
-                                        newdf[i,cols[k]] = string(u, repeatAlleleSuffix)
-                                        break
-                                    else
-                                        found = true
-                                    end
-                                end
-                            end
-                            break
-                        end
-                    end
-                elseif(length(uniq) == 2)
-                    # keep was initialized to false
-                    for u in uniq
-                        @debug "length uniq is 2, u $(u)"
-                        ind = row .== u
-                        if(sum(ind) == 1 || sum(ind) == 3)
-                            @debug "ind $(ind) is 1 or 3, should not keep"
-                            break
-                        elseif(sum(ind) == 2)
-                            @debug "ind $(ind) is 2, should keep"
-                            keep = true
-                            found = false
-                            push!(repSpecies,string(u))
-                            for k in 1:4
-                                if(ind[k])
-                                    if(found)
-                                        newdf[i,cols[k]] = string(u, repeatAlleleSuffix)
-                                        break
-                                    else
-                                        found = true
-                                    end
-                                end
-                            end
-                        end
-                    end
+                    push!(repSpecies, string(u))
+                    # change the second instance of a repeated taxon name with suffix
+                    k = findlast(ind)
+                    newdf[i,cols[k]] = string(u, repeatAlleleSuffix)
                 end
-                @debug "after if, keep is $(keep)"
             end
         end
         keep || push!(delrows, i)
-        @debug "" keep
     end
-    @debug "" delrows
-    @debug "" repSpecies
     nrows = size(newdf,1)
     nkeep = nrows - length(delrows)
     if nkeep < nrows
         print("""found $(length(delrows)) 4-taxon sets uninformative about between-species relationships, out of $(nrows).
               These 4-taxon sets will be deleted from the data frame. $nkeep informative 4-taxon sets will be used.
               """)
         nkeep > 0 || @warn "All 4-taxon subsets are uninformative, so the dataframe will be left empty"
-        deleterows!(newdf, delrows)
+        deleteat!(newdf, delrows) # deleteat! requires DataFrames 1.3
     end
-    # @show size(newdf)
-    return unique(repSpecies)
+    return collect(repSpecies)
 end
 
 
@@ -240,10 +192,9 @@ end
 # function to compare the taxon names in the allele-species matching table
 # and the CF table
 function compareTaxaNames(alleleDF::DataFrame, cfDF::DataFrame, co::Vector{Int})
-    checkMapDF(alleleDF)
-    #println("found $(length(alleleDF[1])) allele-species matches")
+    allelecol, speciescol = checkMapDF(alleleDF)
     CFtaxa = string.(mapreduce(x -> unique(skipmissing(x)), union, eachcol(cfDF[!,co[1:4]])))
-    alleleTaxa = map(string, alleleDF[!,:allele]) # as string, too
+    alleleTaxa = map(string, alleleDF[!,allelecol]) # as string, too
     sizeCF = length(CFtaxa)
     sizeAllele = length(alleleTaxa)
     if sizeAllele > sizeCF
@@ -260,14 +211,26 @@ function compareTaxaNames(alleleDF::DataFrame, cfDF::DataFrame, co::Vector{Int})
         for n in unchanged warnmsg *= " $n"; end
         @warn warnmsg
     end
-    return nothing
+    return allelecol, speciescol
 end
 
-# function to check that the allele df has one column labelled alleles and one column labelled species
+"""
+    checkMapDF(mapping_allele2species::DataFrame)
+
+Check that the data frame has one column named "allele" or "individual",
+and one column named "species". Output: indices of these column.
+"""
 function checkMapDF(alleleDF::DataFrame)
     size(alleleDF,2) >= 2 || error("Allele-Species matching Dataframe should have at least 2 columns")
-    :allele in DataFrames.propertynames(alleleDF) || error("In allele mapping file there is no column named allele")
-    :species in DataFrames.propertynames(alleleDF) || error("In allele mapping file there is no column named species")
+    colnames = DataFrames.propertynames(alleleDF)
+    allelecol = findfirst(x -> x == :allele, colnames)
+    if isnothing(allelecol)
+        allelecol = findfirst(x -> x == :individual, colnames)
+    end
+    isnothing(allelecol) && error("In allele mapping file there is no column named 'allele' or 'individual'")
+    speciescol = findfirst(x -> x == :species, colnames)
+    isnothing(speciescol) && error("In allele mapping file there is no column named species")
+    return allelecol, speciescol
 end