Replies: 10 comments 2 replies
-
Hi, thanks for trying the project!
As long as you have some way to convert your documents to vectors, you
should be able to use the plugin.
It could be a simple "one-hot" encoding, in which case you'd store your
vectors using the sparse boolean datatype and search with jaccard or
hamming similarity. However that's probably not much better than using
elasricsearch's keyword search and "more like this" queries.
If you have a model like doc2vec or something similar for converting
documents to vectors, then you would store your vectors using the dense
float datatype and probably use angular similarity.
I'm still working a lot on performance for approximate queries. You'll
probably have best results if you can narrow searches down to 10-50k
documents and then run exact similarity search. There's an example of this
in the docs:
http://elastiknn.klibisz.com/api/#running-nearest-neighbors-query-on-a-filtered-subset-of-documents
Lmk if you have any questions or issues.
…On Sun, Jul 19, 2020, 10:42 x0rzkov ***@***.***> wrote:
Hi @alexklibisz <https://github.com/alexklibisz> ,
Hope you are all well !
I am developing a faceted search engine for papers and related code. (
https://paper2code.com)
And, I d like to implement a text similarity algorithm on abstracts in
order to suggest other papers to read.
Can I use elastiknn for such task ? because I would like to give a try to
it.
In a nutshell, I want to make of paper2code a window for highlighting some
technologies in neural/similarity search.
Thanks for any inputs or insights on these questions.
Cheers,
X
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#113>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5E27EJL2D4XPIIPYWQLHTR4MA6XANCNFSM4PBLXSJQ>
.
|
Beta Was this translation helpful? Give feedback.
-
Hi, Thanks for the reply. Being a newbie in that field, but I have 230k documents, in fact paragraphs of arxiv abstracts as I mentioned above. Can you help me to implement the model ? I can do the export of these data but I am lost for all other steps ^^. Thanks again. Btw, do you have a twitter account so I can maybe DM and not pollute this issue with my other questions ? Cheers, |
Beta Was this translation helpful? Give feedback.
-
Word/document vectors are a broad topic. The last time I studied them was
about three years ago.
The basic concept is that you start with a completely random vector for
every word and do some sort of optimization or matrix decomposition to move
the vectors around to make them reflect similarity. For example in the
negative sampling method you look at two vectors at a time, if the two
words occured within some common window of text, your optimizer should
change their values to move them closer together. Otherwise it should move
them apart. Repeat this for a large corpus and you'll eventually see
properties like "man + king - woman ~= queen". For document vectors you
need to find some way to combine the word vectors into a single vector,
e.g. average them, but in my experience you need to do something more
complicated than averaging to get good results.
Here's an example I wrote up a while ago:
https://www.kaggle.com/alexklibisz/simple-word-vectors-with-co-occurrence-pmi-and-svd
The gensim python library implements some models, has good docs and is
pretty user friendly. I've also heard good things about fast.ai's work on
this topic.
You can probably study and experiment with some of these techniques before
you worry about a storage/search solution like elastiknn. The model and the
storage/search are separable problems.
I've setup a Gitter Channel but it's not very active yet. I likely won't be
online much this week, other than for work, either.
…On Mon, Jul 20, 2020, 02:52 x0rzkov ***@***.***> wrote:
Hi,
Thanks for the reply.
Being a newbie in that field, but I have 230k documents, in fact
paragraphs of arxiv abstracts as I mentioned above.
Can you help me to implement the model ? I can do the export of these data
but I am lost for all other steps ^^.
Thanks again.
Btw, do you have a twitter account so I can maybe DM and not pollute this
issue with my other questions ?
Cheers,
X
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#113 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5E27GB3S4IZBFAS4AS3VLR4PSSLANCNFSM4PBLXSJQ>
.
|
Beta Was this translation helpful? Give feedback.
-
Hi @x0rzkov, I'm gonna go ahead and close this. However please feel free to comment and I'll re-open when you're at a point where you can generate vectors for your abstracts, and we can talk more about how to use ElastiKnn for the storage/search functionality. |
Beta Was this translation helpful? Give feedback.
-
I missed this earlier. I don't have a Twitter at the moment. Too distracting for me right now! :) |
Beta Was this translation helpful? Give feedback.
-
No worries, thanks for your help. I found https://github.com/DeviantPadam/ResearchPaperScholarlyArticlesRecSystem which is close to what I intend to do. Using the following dataset: https://www.kaggle.com/neelshah18/arxivdataset I'll try to figure out how to convert this dataset (the summary field of course) into vectors. Do you think that can use other attributes like tags and authors for refining the search with elastiknn ? |
Beta Was this translation helpful? Give feedback.
-
Do you think that can use other attributes like tags and authors for
refining the search with elastiknn ?
You definitely can. There is an example of this in the API docs.
Your project sounds exciting. Look forward to seeing how it goes!
…On Tue, Jul 21, 2020, 02:32 x0rzkov ***@***.***> wrote:
No worries, thanks for your help.
I found
https://github.com/DeviantPadam/ResearchPaperScholarlyArticlesRecSystem
which is close to what I intend to do.
Using the following dataset:
https://www.kaggle.com/neelshah18/arxivdataset
I'll try to figure out how to convert this dataset (the summary field of
course) into vectors.
Do you think that can use other attributes like tags and authors for
refining the search with elastiknn ?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#113 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5E27BFJ2YLWCGOZTXEYY3R4UY6PANCNFSM4PBLXSJQ>
.
|
Beta Was this translation helpful? Give feedback.
-
Hi @x0rzkov and @alexklibisz , Thanks for having this wonderful discussion here. (Although I am a bit late by 2+ years, but I would like to participate please.) I need to use elastiknn for a similar use i.e. finding documents (paragraphs) similar/matching with the query, and then refining & finding the answer within top 3-5 paragraphs. Example usecase: I have my documents converted into dense floating point vectors. Please help by pointing me to an Elastiknn example/pseudo code to achieve this in the best possible way. |
Beta Was this translation helpful? Give feedback.
-
Hi @DesiKeki
I don't know of a public example of this particular use-case. There's an example here based on image vectors: https://elastiknn.com/tutorials/multimodal-search-amazon-products-dataset/.
If every doc is a vector, then you cn:
If you have multiple vectors per text document, then you could maybe create a new ES doc for every vector, and just store a pointer back to the original text document. |
Beta Was this translation helpful? Give feedback.
-
Yeah you won't see any difference with 20 docs. Roughly speaking, the approximate methods start to make a difference with 10s of thousands of docs. |
Beta Was this translation helpful? Give feedback.
-
Hi @alexklibisz ,
Hope you are all well !
I am developing a faceted search engine for papers and related code. (https://paper2code.com)
And, I d like to implement a text similarity algorithm on abstracts in order to suggest other papers to read.
Can I use elastiknn for such task ? because I would like to give a try to it.
In a nutshell, I want to make of paper2code a window for highlighting some technologies in neural/similarity search.
Thanks for any inputs or insights on these questions.
Cheers,
X
Beta Was this translation helpful? Give feedback.
All reactions