Skip to content
This repository has been archived by the owner on Aug 27, 2024. It is now read-only.

Multi corpus v2 #43

Merged
merged 4 commits into from
Nov 6, 2023
Merged

Multi corpus v2 #43

merged 4 commits into from
Nov 6, 2023

Conversation

AlexArtrip
Copy link
Collaborator

Implementation of a basic multi-corpus search.

Also, modified the recall dp endpoint to include the scores of the retrieved results.

from memas.interface.storage_driver import DocumentEntity
from memas.interface.exceptions import SentenceLengthOverflowException


def corpora_search(corpus_ids: list[UUID], clue: str) -> list[tuple[float, str, Citation]]:
def mult_corpus_search(corpus_sets : dict[Corpus], clue, ctx, result_limit) -> list[tuple[float, str, Citation]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: multi

# Combine the results and only take the top ones
search_results.sort(key=lambda x: x[0], reverse=True)
# Execute a multicorpus search
# Need to refactor to remove ctx later and have a cleaner solution, but thats time i dont have right now : (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add TODO

from memas.interface.storage_driver import DocumentEntity
from memas.interface.exceptions import SentenceLengthOverflowException


def corpora_search(corpus_ids: list[UUID], clue: str) -> list[tuple[float, str, Citation]]:
def mult_corpus_search(corpus_sets : dict[Corpus], clue, ctx, result_limit) -> list[tuple[float, str, Citation]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: incomplete type, what type are the keys of the corpus_sets dict?

results = defaultdict(list)

# Direct each multicorpus search to the right algorithm
for corpusType, corpora_list in corpus_sets.items() :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename variable name to comply with python standardss

"""
All corpora here should be of the same CorpusType implementation (basic_corpus)
"""
def basic_corpora_search(corpora: list[Corpus], clue: str, ctx) -> list[tuple[float, str, Citation]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this entire function looks pretty sus... It looks like we're jumping out of the corpus implementation and redoing the basic corpus search again? We can keep this for now, but it'd be best to implement this in a way that properly modularizes the logic (feel free to refactor any interfaces that get in the way).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that it is basically a redo of basic corpus search, which isn't ideal. Of course modular code is the way to go, and I was thinking the same thing when I worked on that function. It isn't straightforward without a larger refactor, which I didn't want to do before talking to you. I can work on a possible refactor for that and then we can discuss it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, then sure let's keep this for now


# TODO : It will improve Query speed significantly to fetch citations after determining which documents to send to user

# Take only top few scores and remove scoring element before sending
return [{"document": doc, "citation": asdict(citation)} for doc, citation in search_results[0:5]]
return [{"score" : score, "document": doc, "citation": asdict(citation)} for score, doc, citation in search_results[0:5]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose the scores of the search results? What benefits does it add?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a little iffy on whether exposing scoring to users is worth doing, so I leave it up to you.

I was thinking maybe we do want to expose the results to give developers more insight into what exactly they are retrieving and how they MAY compare with one another. When the corpora scoring isn't comparable it doesn't have obvious value, but I was thinking of how ElasticSearch queries return with scores as well (that can't necessarily be compared with other ES queries).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally think let's leave it out for now, unless there's a strong demand. Main reason being it complicates the interpretation of these results, and is something we can add quite easily

break
combined_results.append(sorted_results_matrix[i][j])
if len(combined_results) >= result_limit:
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This segment for combining the results confuse me. What's the purpose of these two loops, is it to extract an equal number of results from each corpus type equally? In which case wouldn't just extracting the top result_limit/n, where n is the number of corpus types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two loops are for extracting result_limit results, preferably equally for each different (non-comparable) way of corpus scoring. Your suggestion was my initial plan for how to do it, but it gets more complicated when you can't guarantee there are result_limit/n results to fetch from each corpus. The loops are one way of dealing with that while also ordering the results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, sounds good

@maxyu1115 maxyu1115 merged commit e73a05e into main Nov 6, 2023
1 check passed
@maxyu1115 maxyu1115 deleted the multi_corpusV2 branch November 6, 2023 07:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants