Multi corpus v2 #43

AlexArtrip · 2023-11-05T03:28:39Z

Implementation of a basic multi-corpus search.

Also, modified the recall dp endpoint to include the scores of the retrieved results.

maxyu1115 · 2023-11-05T06:15:31Z

memas/corpus/corpus_searching.py

 from memas.interface.storage_driver import DocumentEntity
 from memas.interface.exceptions import SentenceLengthOverflowException


-def corpora_search(corpus_ids: list[UUID], clue: str) -> list[tuple[float, str, Citation]]:
+def mult_corpus_search(corpus_sets : dict[Corpus], clue, ctx, result_limit) -> list[tuple[float, str, Citation]]:


maxyu1115 · 2023-11-05T06:15:55Z

memas/dataplane.py

-    # Combine the results and only take the top ones
-    search_results.sort(key=lambda x: x[0], reverse=True)
+    # Execute a multicorpus search
+    # Need to refactor to remove ctx later and have a cleaner solution, but thats time i dont have right now : (


nit: add TODO

maxyu1115 · 2023-11-05T06:18:24Z

memas/corpus/corpus_searching.py

 from memas.interface.storage_driver import DocumentEntity
 from memas.interface.exceptions import SentenceLengthOverflowException


-def corpora_search(corpus_ids: list[UUID], clue: str) -> list[tuple[float, str, Citation]]:
+def mult_corpus_search(corpus_sets : dict[Corpus], clue, ctx, result_limit) -> list[tuple[float, str, Citation]]:


nit: incomplete type, what type are the keys of the corpus_sets dict?

maxyu1115 · 2023-11-05T06:18:49Z

memas/corpus/corpus_searching.py

+    results = defaultdict(list)
+
+    # Direct each multicorpus search to the right algorithm
+    for corpusType, corpora_list in corpus_sets.items() :


nit: rename variable name to comply with python standardss

maxyu1115 · 2023-11-05T06:23:17Z

memas/corpus/corpus_searching.py

+"""
+All corpora here should be of the same CorpusType implementation (basic_corpus)
+"""
+def basic_corpora_search(corpora: list[Corpus], clue: str, ctx) -> list[tuple[float, str, Citation]]:


this entire function looks pretty sus... It looks like we're jumping out of the corpus implementation and redoing the basic corpus search again? We can keep this for now, but it'd be best to implement this in a way that properly modularizes the logic (feel free to refactor any interfaces that get in the way).

You're right that it is basically a redo of basic corpus search, which isn't ideal. Of course modular code is the way to go, and I was thinking the same thing when I worked on that function. It isn't straightforward without a larger refactor, which I didn't want to do before talking to you. I can work on a possible refactor for that and then we can discuss it.

Ah gotcha, then sure let's keep this for now

maxyu1115 · 2023-11-05T06:24:29Z

memas/dataplane.py


    # TODO : It will improve Query speed significantly to fetch citations after determining which documents to send to user

    # Take only top few scores and remove scoring element before sending
-    return [{"document": doc, "citation": asdict(citation)} for doc, citation in search_results[0:5]]
+    return [{"score" : score, "document": doc, "citation": asdict(citation)} for score, doc, citation in search_results[0:5]]


Do we want to expose the scores of the search results? What benefits does it add?

I'm also a little iffy on whether exposing scoring to users is worth doing, so I leave it up to you.

I was thinking maybe we do want to expose the results to give developers more insight into what exactly they are retrieving and how they MAY compare with one another. When the corpora scoring isn't comparable it doesn't have obvious value, but I was thinking of how ElasticSearch queries return with scores as well (that can't necessarily be compared with other ES queries).

Personally think let's leave it out for now, unless there's a strong demand. Main reason being it complicates the interpretation of these results, and is something we can add quite easily

maxyu1115 · 2023-11-05T06:40:43Z

memas/corpus/corpus_searching.py

+                break
+            combined_results.append(sorted_results_matrix[i][j])
+        if len(combined_results) >= result_limit:
+            break


This segment for combining the results confuse me. What's the purpose of these two loops, is it to extract an equal number of results from each corpus type equally? In which case wouldn't just extracting the top result_limit/n, where n is the number of corpus types?

The two loops are for extracting result_limit results, preferably equally for each different (non-comparable) way of corpus scoring. Your suggestion was my initial plan for how to do it, but it gets more complicated when you can't guarantee there are result_limit/n results to fetch from each corpus. The loops are one way of dealing with that while also ordering the results.

Gotcha, sounds good

AlexArtrip added 2 commits November 3, 2023 02:47

basic multicorp implementation

6bf4c4c

basic multicorpus implemented

ef9a764

maxyu1115 reviewed Nov 5, 2023

View reviewed changes

AlexArtrip added 2 commits November 6, 2023 00:59

fixed nits & removed sending scores to users

707a17b

formatting

152b97d

maxyu1115 approved these changes Nov 6, 2023

View reviewed changes

maxyu1115 merged commit e73a05e into main Nov 6, 2023
1 check passed

maxyu1115 deleted the multi_corpusV2 branch November 6, 2023 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi corpus v2 #43

Multi corpus v2 #43

AlexArtrip commented Nov 5, 2023

maxyu1115 Nov 5, 2023

maxyu1115 Nov 5, 2023

maxyu1115 Nov 5, 2023

maxyu1115 Nov 5, 2023

maxyu1115 Nov 5, 2023

AlexArtrip Nov 5, 2023

maxyu1115 Nov 5, 2023

maxyu1115 Nov 5, 2023

AlexArtrip Nov 5, 2023

maxyu1115 Nov 5, 2023

maxyu1115 Nov 5, 2023

AlexArtrip Nov 5, 2023

maxyu1115 Nov 5, 2023

Multi corpus v2 #43

Multi corpus v2 #43

Conversation

AlexArtrip commented Nov 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment