You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I understand, the "match" output of mash distance is the number of matching hashes among all hashes. Number of all hashes is defined by sketch size. Right?
I call mash through python subprocess to calculate mash dist pairwisely from lists of query and reference. The k-mer is constant for all pairs and sketch size = length of the longest sequence in pair (I compare short sequnces - genes). I make a sketch file separatly for two sequnces and then compare them with "mash dist".
And sometimes in output I see that match value is different among pairs with same sketch size.
E.g. sketch size=3155, k-mer=11
pair1 match=44/44
pair2 match=0/88 Why could it happen?
def mash_dist_tune_pairwise(reference_seq_dict,query_seq_dict,fna_dir,k_mer,mashapp,threads):
#create matrix of scatch size for each comparison
scketch_size_matrix=np.empty((len(reference_seq_dict),len(query_seq_dict))) #Create an empty array.
scketch_size_matrix.fill(np. NaN)
#create matrix of k-mer sizes for each comparison
kmer_matrix=np.empty((len(reference_seq_dict),len(query_seq_dict)))
kmer_matrix.fill(k_mer)
#fill both matrix
for i1,key1 in enumerate(reference_seq_dict.keys()):
for i2,key2 in enumerate(query_seq_dict.keys()):
g=np.max([len(reference_seq_dict[key1]['Sequence']),len(query_seq_dict[key2]['Sequence'])])
scketch_size_matrix[i1,i2]=round(g)
#creat empty dict
dist_dict={}
#make sketch for each pair
for i1,key1 in enumerate(reference_seq_dict.keys()):
dist_dict[key1]={}
ref_filename="".join(["./",fna_dir,"/",key1,".fna"]) #sequence file path
for i2,key2 in enumerate(query_seq_dict.keys()):
query_filename="".join(["./",fna_dir,"/",key2,".fna"])
sketch_size=scketch_size_matrix[i1,i2]
k_mer=kmer_matrix[i1,i2]
#command
cmd_sketch1=f"{mashapp} sketch -p {threads} -k {k_mer} -s {sketch_size} {ref_filename}"
cmd_sketch2=f"{mashapp} sketch -p {threads} -k {k_mer} -s {sketch_size} {query_filename}"
#run
ref_sketch_res=subprocess.run([cmd_sketch1], shell=True, capture_output=True, text=True)
query_sketch_res=subprocess.run([cmd_sketch2], shell=True, capture_output=True, text=True)
#mash dist
with Popen(f"{mashapp} dist -p {threads} {ref_filename}.msh {query_filename}.msh", shell=True, stdout=PIPE) as process:
query_res = pd.read_csv(process.stdout,sep="\t",header=None)
query_res.index=[key1]
query_res=query_res.drop([0,1],axis=1)
query_res.columns=["mshdist","p","match"]
query_res_dict=query_res.to_dict(orient="dict")
dist_dict[key1][key2]=query_res_dict
return (dist_dict,scketch_size_matrix,kmer_matrix)
UPD: I think (but not shure) that the reason is that there are not enough small hashes to get up to 3000 hashes. My assumption is based on this line from publication:
For a sketch size s and genome size n, a bottom sketch can be efficiently computed in O(n log s) time by maintaining a sorted list of size s and updating the current sketch only when a new hash is smaller than the current sketch maximum.
It is also evident from the fact that if I increas the sketch size (e.g. from 3000 to 10000), then the match is still the same (e.g. 44).
What strikes me is that pairs of sequences have identity score 97, which means they are highly similar. So this is weird that mash does not recognise this. I had to decrease the k-mer size to 4 to recieve an expected low mash distance...
The text was updated successfully, but these errors were encountered:
As I understand, the "match" output of mash distance is the number of matching hashes among all hashes. Number of all hashes is defined by sketch size. Right?
I call mash through python subprocess to calculate mash dist pairwisely from lists of query and reference. The k-mer is constant for all pairs and sketch size = length of the longest sequence in pair (I compare short sequnces - genes). I make a sketch file separatly for two sequnces and then compare them with "mash dist".
And sometimes in output I see that match value is different among pairs with same sketch size.
E.g. sketch size=3155, k-mer=11
pair1 match=44/44
pair2 match=0/88
Why could it happen?
UPD: I think (but not shure) that the reason is that there are not enough small hashes to get up to 3000 hashes. My assumption is based on this line from publication:
It is also evident from the fact that if I increas the sketch size (e.g. from 3000 to 10000), then the match is still the same (e.g. 44).
What strikes me is that pairs of sequences have identity score 97, which means they are highly similar. So this is weird that mash does not recognise this. I had to decrease the k-mer size to 4 to recieve an expected low mash distance...
The text was updated successfully, but these errors were encountered: