The match is inconsistent with sketch size #183

vvsbiocode · 2024-06-26T20:39:56Z

As I understand, the "match" output of mash distance is the number of matching hashes among all hashes. Number of all hashes is defined by sketch size. Right?
I call mash through python subprocess to calculate mash dist pairwisely from lists of query and reference. The k-mer is constant for all pairs and sketch size = length of the longest sequence in pair (I compare short sequnces - genes). I make a sketch file separatly for two sequnces and then compare them with "mash dist".
And sometimes in output I see that match value is different among pairs with same sketch size.
E.g. sketch size=3155, k-mer=11
pair1 match=44/44
pair2 match=0/88
Why could it happen?

def mash_dist_tune_pairwise(reference_seq_dict,query_seq_dict,fna_dir,k_mer,mashapp,threads):
#create matrix of scatch size for each comparison
scketch_size_matrix=np.empty((len(reference_seq_dict),len(query_seq_dict))) #Create an empty array.
scketch_size_matrix.fill(np. NaN)
#create matrix of k-mer sizes for each comparison
kmer_matrix=np.empty((len(reference_seq_dict),len(query_seq_dict)))
kmer_matrix.fill(k_mer)
#fill both matrix
for i1,key1 in enumerate(reference_seq_dict.keys()):
for i2,key2 in enumerate(query_seq_dict.keys()):
g=np.max([len(reference_seq_dict[key1]['Sequence']),len(query_seq_dict[key2]['Sequence'])])
scketch_size_matrix[i1,i2]=round(g)
#creat empty dict
dist_dict={}
#make sketch for each pair
for i1,key1 in enumerate(reference_seq_dict.keys()):
dist_dict[key1]={}
ref_filename="".join(["./",fna_dir,"/",key1,".fna"]) #sequence file path
for i2,key2 in enumerate(query_seq_dict.keys()):
query_filename="".join(["./",fna_dir,"/",key2,".fna"])
sketch_size=scketch_size_matrix[i1,i2]
k_mer=kmer_matrix[i1,i2]
#command
cmd_sketch1=f"{mashapp} sketch -p {threads} -k {k_mer} -s {sketch_size} {ref_filename}"
cmd_sketch2=f"{mashapp} sketch -p {threads} -k {k_mer} -s {sketch_size} {query_filename}"
#run
ref_sketch_res=subprocess.run([cmd_sketch1], shell=True, capture_output=True, text=True)
query_sketch_res=subprocess.run([cmd_sketch2], shell=True, capture_output=True, text=True)
#mash dist
with Popen(f"{mashapp} dist -p {threads} {ref_filename}.msh {query_filename}.msh", shell=True, stdout=PIPE) as process:
query_res = pd.read_csv(process.stdout,sep="\t",header=None)
query_res.index=[key1]
query_res=query_res.drop([0,1],axis=1)
query_res.columns=["mshdist","p","match"]
query_res_dict=query_res.to_dict(orient="dict")
dist_dict[key1][key2]=query_res_dict
return (dist_dict,scketch_size_matrix,kmer_matrix)

UPD: I think (but not shure) that the reason is that there are not enough small hashes to get up to 3000 hashes. My assumption is based on this line from publication:

For a sketch size s and genome size n, a bottom sketch can be efficiently computed in O(n log s) time by maintaining a sorted list of size s and updating the current sketch only when a new hash is smaller than the current sketch maximum.

It is also evident from the fact that if I increas the sketch size (e.g. from 3000 to 10000), then the match is still the same (e.g. 44).
What strikes me is that pairs of sequences have identity score 97, which means they are highly similar. So this is weird that mash does not recognise this. I had to decrease the k-mer size to 4 to recieve an expected low mash distance...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The match is inconsistent with sketch size #183

The match is inconsistent with sketch size #183

vvsbiocode commented Jun 26, 2024 •

edited

Loading

The match is inconsistent with sketch size #183

The match is inconsistent with sketch size #183

Comments

vvsbiocode commented Jun 26, 2024 • edited Loading

vvsbiocode commented Jun 26, 2024 •

edited

Loading