-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
memas/corpus/corpus_searching.py
Outdated
from memas.interface.storage_driver import DocumentEntity | ||
from memas.interface.exceptions import SentenceLengthOverflowException | ||
|
||
|
||
def corpora_search(corpus_ids: list[UUID], clue: str) -> list[tuple[float, str, Citation]]: | ||
def mult_corpus_search(corpus_sets : dict[Corpus], clue, ctx, result_limit) -> list[tuple[float, str, Citation]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: multi
memas/dataplane.py
Outdated
# Combine the results and only take the top ones | ||
search_results.sort(key=lambda x: x[0], reverse=True) | ||
# Execute a multicorpus search | ||
# Need to refactor to remove ctx later and have a cleaner solution, but thats time i dont have right now : ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add TODO
memas/corpus/corpus_searching.py
Outdated
from memas.interface.storage_driver import DocumentEntity | ||
from memas.interface.exceptions import SentenceLengthOverflowException | ||
|
||
|
||
def corpora_search(corpus_ids: list[UUID], clue: str) -> list[tuple[float, str, Citation]]: | ||
def mult_corpus_search(corpus_sets : dict[Corpus], clue, ctx, result_limit) -> list[tuple[float, str, Citation]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: incomplete type, what type are the keys of the corpus_sets dict?
memas/corpus/corpus_searching.py
Outdated
results = defaultdict(list) | ||
|
||
# Direct each multicorpus search to the right algorithm | ||
for corpusType, corpora_list in corpus_sets.items() : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename variable name to comply with python standardss
""" | ||
All corpora here should be of the same CorpusType implementation (basic_corpus) | ||
""" | ||
def basic_corpora_search(corpora: list[Corpus], clue: str, ctx) -> list[tuple[float, str, Citation]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this entire function looks pretty sus... It looks like we're jumping out of the corpus implementation and redoing the basic corpus search again? We can keep this for now, but it'd be best to implement this in a way that properly modularizes the logic (feel free to refactor any interfaces that get in the way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that it is basically a redo of basic corpus search, which isn't ideal. Of course modular code is the way to go, and I was thinking the same thing when I worked on that function. It isn't straightforward without a larger refactor, which I didn't want to do before talking to you. I can work on a possible refactor for that and then we can discuss it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha, then sure let's keep this for now
memas/dataplane.py
Outdated
|
||
# TODO : It will improve Query speed significantly to fetch citations after determining which documents to send to user | ||
|
||
# Take only top few scores and remove scoring element before sending | ||
return [{"document": doc, "citation": asdict(citation)} for doc, citation in search_results[0:5]] | ||
return [{"score" : score, "document": doc, "citation": asdict(citation)} for score, doc, citation in search_results[0:5]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to expose the scores of the search results? What benefits does it add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also a little iffy on whether exposing scoring to users is worth doing, so I leave it up to you.
I was thinking maybe we do want to expose the results to give developers more insight into what exactly they are retrieving and how they MAY compare with one another. When the corpora scoring isn't comparable it doesn't have obvious value, but I was thinking of how ElasticSearch queries return with scores as well (that can't necessarily be compared with other ES queries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally think let's leave it out for now, unless there's a strong demand. Main reason being it complicates the interpretation of these results, and is something we can add quite easily
break | ||
combined_results.append(sorted_results_matrix[i][j]) | ||
if len(combined_results) >= result_limit: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This segment for combining the results confuse me. What's the purpose of these two loops, is it to extract an equal number of results from each corpus type equally? In which case wouldn't just extracting the top result_limit/n
, where n
is the number of corpus types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two loops are for extracting result_limit results, preferably equally for each different (non-comparable) way of corpus scoring. Your suggestion was my initial plan for how to do it, but it gets more complicated when you can't guarantee there are result_limit/n results to fetch from each corpus. The loops are one way of dealing with that while also ordering the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, sounds good
Implementation of a basic multi-corpus search.
Also, modified the recall dp endpoint to include the scores of the retrieved results.