use case; text similarity on paper abstracts #365

ghost · 2020-07-19T14:42:38Z

ghost
Jul 19, 2020

Hi @alexklibisz ,

Hope you are all well !

I am developing a faceted search engine for papers and related code. (https://paper2code.com)

And, I d like to implement a text similarity algorithm on abstracts in order to suggest other papers to read.

Can I use elastiknn for such task ? because I would like to give a try to it.

In a nutshell, I want to make of paper2code a window for highlighting some technologies in neural/similarity search.

Thanks for any inputs or insights on these questions.

Cheers,
X

alexklibisz · 2020-07-19T16:33:25Z

alexklibisz
Jul 19, 2020
Maintainer

Hi, thanks for trying the project! As long as you have some way to convert your documents to vectors, you should be able to use the plugin. It could be a simple "one-hot" encoding, in which case you'd store your vectors using the sparse boolean datatype and search with jaccard or hamming similarity. However that's probably not much better than using elasricsearch's keyword search and "more like this" queries. If you have a model like doc2vec or something similar for converting documents to vectors, then you would store your vectors using the dense float datatype and probably use angular similarity. I'm still working a lot on performance for approximate queries. You'll probably have best results if you can narrow searches down to 10-50k documents and then run exact similarity search. There's an example of this in the docs: http://elastiknn.klibisz.com/api/#running-nearest-neighbors-query-on-a-filtered-subset-of-documents Lmk if you have any questions or issues.

…

On Sun, Jul 19, 2020, 10:42 x0rzkov ***@***.***> wrote: Hi @alexklibisz <https://github.com/alexklibisz> , Hope you are all well ! I am developing a faceted search engine for papers and related code. ( https://paper2code.com) And, I d like to implement a text similarity algorithm on abstracts in order to suggest other papers to read. Can I use elastiknn for such task ? because I would like to give a try to it. In a nutshell, I want to make of paper2code a window for highlighting some technologies in neural/similarity search. Thanks for any inputs or insights on these questions. Cheers, X — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#113>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5E27EJL2D4XPIIPYWQLHTR4MA6XANCNFSM4PBLXSJQ> .

0 replies

ghost · 2020-07-20T06:52:06Z

ghost
Jul 20, 2020

Hi,

Thanks for the reply.

Being a newbie in that field, but I have 230k documents, in fact paragraphs of arxiv abstracts as I mentioned above.

Can you help me to implement the model ? I can do the export of these data but I am lost for all other steps ^^.

Thanks again.

Btw, do you have a twitter account so I can maybe DM and not pollute this issue with my other questions ?

Cheers,
X

0 replies

alexklibisz · 2020-07-20T11:45:18Z

alexklibisz
Jul 20, 2020
Maintainer

Word/document vectors are a broad topic. The last time I studied them was about three years ago. The basic concept is that you start with a completely random vector for every word and do some sort of optimization or matrix decomposition to move the vectors around to make them reflect similarity. For example in the negative sampling method you look at two vectors at a time, if the two words occured within some common window of text, your optimizer should change their values to move them closer together. Otherwise it should move them apart. Repeat this for a large corpus and you'll eventually see properties like "man + king - woman ~= queen". For document vectors you need to find some way to combine the word vectors into a single vector, e.g. average them, but in my experience you need to do something more complicated than averaging to get good results. Here's an example I wrote up a while ago: https://www.kaggle.com/alexklibisz/simple-word-vectors-with-co-occurrence-pmi-and-svd The gensim python library implements some models, has good docs and is pretty user friendly. I've also heard good things about fast.ai's work on this topic. You can probably study and experiment with some of these techniques before you worry about a storage/search solution like elastiknn. The model and the storage/search are separable problems. I've setup a Gitter Channel but it's not very active yet. I likely won't be online much this week, other than for work, either.

…

On Mon, Jul 20, 2020, 02:52 x0rzkov ***@***.***> wrote: Hi, Thanks for the reply. Being a newbie in that field, but I have 230k documents, in fact paragraphs of arxiv abstracts as I mentioned above. Can you help me to implement the model ? I can do the export of these data but I am lost for all other steps ^^. Thanks again. Btw, do you have a twitter account so I can maybe DM and not pollute this issue with my other questions ? Cheers, X — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#113 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5E27GB3S4IZBFAS4AS3VLR4PSSLANCNFSM4PBLXSJQ> .

0 replies

alexklibisz · 2020-07-21T01:34:39Z

alexklibisz
Jul 21, 2020
Maintainer

Hi @x0rzkov, I'm gonna go ahead and close this. However please feel free to comment and I'll re-open when you're at a point where you can generate vectors for your abstracts, and we can talk more about how to use ElastiKnn for the storage/search functionality.

0 replies

alexklibisz · 2020-07-21T01:35:40Z

alexklibisz
Jul 21, 2020
Maintainer

Btw, do you have a twitter account so I can maybe DM and not pollute this issue with my other questions ?

I missed this earlier. I don't have a Twitter at the moment. Too distracting for me right now! :)

0 replies

ghost · 2020-07-21T06:31:52Z

ghost
Jul 21, 2020

No worries, thanks for your help.

I found https://github.com/DeviantPadam/ResearchPaperScholarlyArticlesRecSystem which is close to what I intend to do.

Using the following dataset: https://www.kaggle.com/neelshah18/arxivdataset

I'll try to figure out how to convert this dataset (the summary field of course) into vectors.

Do you think that can use other attributes like tags and authors for refining the search with elastiknn ?

0 replies

alexklibisz · 2020-07-21T12:57:40Z

alexklibisz
Jul 21, 2020
Maintainer

Do you think that can use other attributes like tags and authors for

refining the search with elastiknn ? You definitely can. There is an example of this in the API docs. Your project sounds exciting. Look forward to seeing how it goes!

…

On Tue, Jul 21, 2020, 02:32 x0rzkov ***@***.***> wrote: No worries, thanks for your help. I found https://github.com/DeviantPadam/ResearchPaperScholarlyArticlesRecSystem which is close to what I intend to do. Using the following dataset: https://www.kaggle.com/neelshah18/arxivdataset I'll try to figure out how to convert this dataset (the summary field of course) into vectors. Do you think that can use other attributes like tags and authors for refining the search with elastiknn ? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#113 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5E27BFJ2YLWCGOZTXEYY3R4UY6PANCNFSM4PBLXSJQ> .

0 replies

DesiKeki · 2022-11-08T17:24:41Z

DesiKeki
Nov 8, 2022

Hi @x0rzkov and @alexklibisz ,

Thanks for having this wonderful discussion here. (Although I am a bit late by 2+ years, but I would like to participate please.)

I need to use elastiknn for a similar use i.e. finding documents (paragraphs) similar/matching with the query, and then refining & finding the answer within top 3-5 paragraphs.

Example usecase:
User query: What are the pre-requisites for getting a car loan under employee loan scheme?
Elastiknn engine searches top 5 documents related to employee loan out of some 100k policy documents.

I have my documents converted into dense floating point vectors.

Please help by pointing me to an Elastiknn example/pseudo code to achieve this in the best possible way.

0 replies

alexklibisz · 2022-11-09T03:23:04Z

alexklibisz
Nov 9, 2022
Maintainer

Hi @DesiKeki

Please help by pointing me to an Elastiknn example/pseudo code to achieve this in the best possible way.

I don't know of a public example of this particular use-case. There's an example here based on image vectors: https://elastiknn.com/tutorials/multimodal-search-amazon-products-dataset/.

I have my documents converted into dense floating point vectors.

If every doc is a vector, then you cn:

Store the vectors as elastiknn_dense_float_vectors in an ES index (covered in the example above)
For each query:
a. Convert the query text to a vector using the same model that was used on the documents
b. Run an elastiknn_nearest_neighbors query against the index from step 1

If you have multiple vectors per text document, then you could maybe create a new ES doc for every vector, and just store a pointer back to the original text document.

2 replies

DesiKeki Nov 9, 2022

Thanks a lot for the quick reply @alexklibisz
Let me try this and get back.

BTW, do you have an email id where I can quickly connect with you and share files etc?

DesiKeki Nov 29, 2022

Hi @DesiKeki

Please help by pointing me to an Elastiknn example/pseudo code to achieve this in the best possible way.

I don't know of a public example of this particular use-case. There's an example here based on image vectors: https://elastiknn.com/tutorials/multimodal-search-amazon-products-dataset/.

I have my documents converted into dense floating point vectors.

If every doc is a vector, then you cn:

Store the vectors as elastiknn_dense_float_vectors in an ES index (covered in the example above)

For each query:
a. Convert the query text to a vector using the same model that was used on the documents
b. Run an elastiknn_nearest_neighbors query against the index from step 1

If you have multiple vectors per text document, then you could maybe create a new ES doc for every vector, and just store a pointer back to the original text document.

Hi @alexklibisz

With the help of example based on image vector and above steps mentioned by you, I am now able to create and run a pipeline for text search using elastiknn_dense_float_vectors. It is giving me decent results.

I also created another pipeline with cosine similarity search using dense_vectors in ES index. This also gave me similar results as elastiknn_dense_float_vectors with knn search.

I wish to know where exactly am I getting an edge by using elastiknn instead of normal dense_vector cosine similarity search?
Am I seeing the similar results because my data set is too small currently (only 20 documents) or I might be missing something else?

--
Keki

alexklibisz · 2022-11-29T14:50:33Z

alexklibisz
Nov 29, 2022
Maintainer

Am I seeing the similar results because my data set is too small currently (only 20 documents) or I might be missing something else?

Yeah you won't see any difference with 20 docs. Roughly speaking, the approximate methods start to make a difference with 10s of thousands of docs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use case; text similarity on paper abstracts #365

{{title}}

Replies: 10 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

use case; text similarity on paper abstracts #365

ghost Jul 19, 2020

Replies: 10 comments · 2 replies

alexklibisz Jul 19, 2020 Maintainer

ghost Jul 20, 2020

alexklibisz Jul 20, 2020 Maintainer

alexklibisz Jul 21, 2020 Maintainer

alexklibisz Jul 21, 2020 Maintainer

ghost Jul 21, 2020

alexklibisz Jul 21, 2020 Maintainer

DesiKeki Nov 8, 2022

alexklibisz Nov 9, 2022 Maintainer

DesiKeki Nov 9, 2022

DesiKeki Nov 29, 2022

alexklibisz Nov 29, 2022 Maintainer

ghost
Jul 19, 2020

Replies: 10 comments 2 replies

alexklibisz
Jul 19, 2020
Maintainer

ghost
Jul 20, 2020

alexklibisz
Jul 20, 2020
Maintainer

alexklibisz
Jul 21, 2020
Maintainer

alexklibisz
Jul 21, 2020
Maintainer

ghost
Jul 21, 2020

alexklibisz
Jul 21, 2020
Maintainer

DesiKeki
Nov 8, 2022

alexklibisz
Nov 9, 2022
Maintainer

alexklibisz
Nov 29, 2022
Maintainer