Input already tokenized array #72

Nireas1 · 2022-03-01T20:13:06Z

Nireas1
Mar 1, 2022

Hello,
Is it possible to use an already tokenized text as input?
Right now from my understanding we have to have a string and we can get the tokens like below:
const doc = nlp.readDoc(text); // where text is of type string
const tokens = doc.tokens().out();

The text I have is already tokenized as an array of strings and although I can use the 'as' helper wrongly eg. like this as.bow(alreadyTokenizedText) to get a bag of words, there are a lot of limitations with its other functions or the 'its' helper.
I might be able to use the wink-nlp-utils but as far as I saw it doesn't have functionality like bm25 or similarity detection methods that I want.
So is there a way to use an array of strings as input instead of text or any other solution?

Thank you for your time.

Answered by rachnachakraborty

Mar 3, 2022

Hi,

Thanks for elaborating on the requirement.

Your need is unique for winkNLP as it is a processing pipe which is handling all the annotations with readDoc('input text'). These annotations also include the token properties such as stop word, abbreviations, token type, negation and many more. Thus, an only array of tokens is not sufficient to use helpers.

You will have to follow a hybrid approach in such a case, like you already have tried.

Here are some how-tos on winkNLP.

Cheers,
Rachna

View full answer

rachnachakraborty · 2022-03-02T18:04:31Z

rachnachakraborty
Mar 2, 2022

Hi @Nireas1,

We are not very clear about the exact problem you are experiencing using winkNLP.

BM25 Vectorization and Similarity features are very much available in winkNLP taking tokenized text as an input. Please refer to the documentation .

Request you to share the use case along with the problem statement.

Thanks,
Rachna

1 reply

Nireas1 Mar 2, 2022
Author

Sorry if I wasn't clear.
What I meant is that there are a lot of limitations when having tokenized text as input because you can't use things like the helpers.
In order to use them you have to go through the readDoc method that only takes string.

For example, if I wanted to remove the stopwords from an already tokenized text like below:
const t = ["this", "is", "an", "example"];
I wouldn't be able to do it since I can't use the its.stopWordFlag.

Is my only choice to use wink-nlp in conjuction with other packages like wink-nlp-utils to solve that limitation?

Again, thank you for your time.

rachnachakraborty · 2022-03-03T06:10:57Z

rachnachakraborty
Mar 3, 2022

Hi,

Thanks for elaborating on the requirement.

Your need is unique for winkNLP as it is a processing pipe which is handling all the annotations with readDoc('input text'). These annotations also include the token properties such as stop word, abbreviations, token type, negation and many more. Thus, an only array of tokens is not sufficient to use helpers.

You will have to follow a hybrid approach in such a case, like you already have tried.

Here are some how-tos on winkNLP.

Cheers,
Rachna

3 replies

Nireas1 Mar 3, 2022
Author

Thank you very much.
I just wanted to make sure I wasn't missing something.
I will use it in conjuction with other wink packages as you suggested, too.

Thanks again.

rachnachakraborty Mar 4, 2022

Was curious to know, whether the input text is already tokenized and there is no access to the raw text?

In case of tokenized array, you can join them on space to create a raw text file and pass it through the winkNLP pipe and leverage all the features.

Please checkout if this option works for you.

Thanks,
Rachna

Nireas1 Mar 4, 2022
Author

I tried that but the array is produced by wordnet and has some words like "particular(a)", "designate(ip)", "unique(p)" which nlp seperates into two different words with punctuation (eg. "designate", "(", "p", ")" ) and I would like to have them as is, because they seem to produce better results using similarity methods like cosine similarity.

Thanks,
Nireas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input already tokenized array #72

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Input already tokenized array #72

Nireas1 Mar 1, 2022

Replies: 2 comments · 4 replies

rachnachakraborty Mar 2, 2022

Nireas1 Mar 2, 2022 Author

rachnachakraborty Mar 3, 2022

Nireas1 Mar 3, 2022 Author

rachnachakraborty Mar 4, 2022

Nireas1 Mar 4, 2022 Author

Nireas1
Mar 1, 2022

Replies: 2 comments 4 replies

rachnachakraborty
Mar 2, 2022

Nireas1 Mar 2, 2022
Author

rachnachakraborty
Mar 3, 2022

Nireas1 Mar 3, 2022
Author

Nireas1 Mar 4, 2022
Author