-
Notifications
You must be signed in to change notification settings - Fork 238
Open
Description
The following contains a high-level view of what will be the next main enhancement steps. This document will be kept up-to-date and improved frequently. This work will be mainly conducted by @mk2510 and @henrifroese as part of their SummerOfCode project.
-
Version 1.10
- Every representation function to receive as input a
TokenSeries
TokenSeries as input to every representation functionΒ #44 - Decouple TF-IDF L2-normalization and TF-IDF tfidf(s): remove normalization, improve docstringΒ #76
- Rename
term_frequency
tocount()
+ add functionterm_frequency
count(s) and term_frequency(s)Β #61 - Introduce
HeroSeries
- Add ~ hero.norm(RepresetationSeries, "l1"/"l2")
- Can we avoid the use of
VectorSeries
/TokenSeries
? - All
representation
functions to deal withHeroSeries
+ (DocumentTermDF) Support "Pandas Series Representation"Β #43 - Update README + getting-started.md
- Push a new version to PyPi
- Every representation function to receive as input a
-
Performance: speed-up the library
- Most of Texthero data structure are list of list ([["a", "document"], ["another", "document"]]), can we leverage parallelization? We can learn from spaCy. Mandatory read: 100-times-faster-nlp; look at this for parallelization
- Make spaCy function faster + Dask vs Spacy Make spaCy-nlp functions fasterΒ #65
- Depending on the previous task, evaluate if we want to have as default tokenizer
spaCy
: tokenize with SpacyΒ #131
-
Software development:
- Integrate checking for correct Series types (Kind of Pandas SeriesΒ #60, Check if Series consists of strings only, instead of casting to unicodeΒ #55, ...)
- Check hero functions work with np.nan All function to deal with np.nanΒ #86
-
Support Embeddings through Flair
- Add hero.embed(s, flairEmbedding)
-
Add Topic Modeling
- Add topic modeling support under representation Implement/support/explain topic modellingΒ #42
This include also "topic modeling visualization" to get insights out of it - Add a blog article on how topic modeling with Texthero works
- Add topic modeling support under representation Implement/support/explain topic modellingΒ #42
-
Extra
- test coverage
- expand multilingual: more languages; recognize languages and select correct one
- (low priority) Text summarization (Preprocessing: explain how to create a custom pipelineΒ #38) and characteristic terms (Characteristic Terms & KeywordsΒ #2)
vidyap-xgboost, henrifroese and mk2510
Metadata
Metadata
Assignees
Labels
No labels