Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The match is inconsistent with sketch size #183

Open
vvsbiocode opened this issue Jun 26, 2024 · 0 comments
Open

The match is inconsistent with sketch size #183

vvsbiocode opened this issue Jun 26, 2024 · 0 comments

Comments

@vvsbiocode
Copy link

vvsbiocode commented Jun 26, 2024

As I understand, the "match" output of mash distance is the number of matching hashes among all hashes. Number of all hashes is defined by sketch size. Right?
I call mash through python subprocess to calculate mash dist pairwisely from lists of query and reference. The k-mer is constant for all pairs and sketch size = length of the longest sequence in pair (I compare short sequnces - genes). I make a sketch file separatly for two sequnces and then compare them with "mash dist".
And sometimes in output I see that match value is different among pairs with same sketch size.
E.g. sketch size=3155, k-mer=11
pair1 match=44/44
pair2 match=0/88
Why could it happen?

def mash_dist_tune_pairwise(reference_seq_dict,query_seq_dict,fna_dir,k_mer,mashapp,threads):
#create matrix of scatch size for each comparison
scketch_size_matrix=np.empty((len(reference_seq_dict),len(query_seq_dict))) #Create an empty array.
scketch_size_matrix.fill(np. NaN)
#create matrix of k-mer sizes for each comparison
kmer_matrix=np.empty((len(reference_seq_dict),len(query_seq_dict)))
kmer_matrix.fill(k_mer)
#fill both matrix
for i1,key1 in enumerate(reference_seq_dict.keys()):
for i2,key2 in enumerate(query_seq_dict.keys()):
g=np.max([len(reference_seq_dict[key1]['Sequence']),len(query_seq_dict[key2]['Sequence'])])
scketch_size_matrix[i1,i2]=round(g)
#creat empty dict
dist_dict={}
#make sketch for each pair
for i1,key1 in enumerate(reference_seq_dict.keys()):
dist_dict[key1]={}
ref_filename="".join(["./",fna_dir,"/",key1,".fna"]) #sequence file path
for i2,key2 in enumerate(query_seq_dict.keys()):
query_filename="".join(["./",fna_dir,"/",key2,".fna"])
sketch_size=scketch_size_matrix[i1,i2]
k_mer=kmer_matrix[i1,i2]
#command
cmd_sketch1=f"{mashapp} sketch -p {threads} -k {k_mer} -s {sketch_size} {ref_filename}"
cmd_sketch2=f"{mashapp} sketch -p {threads} -k {k_mer} -s {sketch_size} {query_filename}"
#run
ref_sketch_res=subprocess.run([cmd_sketch1], shell=True, capture_output=True, text=True)
query_sketch_res=subprocess.run([cmd_sketch2], shell=True, capture_output=True, text=True)
#mash dist
with Popen(f"{mashapp} dist -p {threads} {ref_filename}.msh {query_filename}.msh", shell=True, stdout=PIPE) as process:
query_res = pd.read_csv(process.stdout,sep="\t",header=None)
query_res.index=[key1]
query_res=query_res.drop([0,1],axis=1)
query_res.columns=["mshdist","p","match"]
query_res_dict=query_res.to_dict(orient="dict")
dist_dict[key1][key2]=query_res_dict
return (dist_dict,scketch_size_matrix,kmer_matrix)

UPD: I think (but not shure) that the reason is that there are not enough small hashes to get up to 3000 hashes. My assumption is based on this line from publication:

For a sketch size s and genome size n, a bottom sketch can be efficiently computed in O(n log s) time by maintaining a sorted list of size s and updating the current sketch only when a new hash is smaller than the current sketch maximum.

It is also evident from the fact that if I increas the sketch size (e.g. from 3000 to 10000), then the match is still the same (e.g. 44).
What strikes me is that pairs of sequences have identity score 97, which means they are highly similar. So this is weird that mash does not recognise this. I had to decrease the k-mer size to 4 to recieve an expected low mash distance...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant