-
Notifications
You must be signed in to change notification settings - Fork 95
Description
The Segmentation tool you provide is excellent. One feature request:
Unless I am mistaken, the tool always provided the split words in 1. lower case, and 2. does not provide information for where spaces were inserted. Instead, a preserve_case or capitalize parameter would be helpful (for 1). The following code capitalizes the split string according to the capitalization used in the hashtag.
from ekphrasis.classes.segmenter import Segmenter
segmenter = Segmenter(corpus="twitter")
def word_segmentation(text, fix_case=True):
words_string = segmenter.segment(text)
if not fix_case:
return words_string
fixed = ""
n_add = 0
for i in range(len(words_string)):
if words_string[i] == " " and text[i+n_add] != " ":
n_add += 1
fixed += " "
continue
is_capital = text[i-n_add].isupper()
if is_capital:
fixed += words_string[i].upper()
else:
fixed += words_string[i]
return fixed
Of course, if the user is using camelCase or PascalCase, the capitalization may not be meaningful, but in other cases, this can be. For instance:
I #eatsomuch food --> I eat so much food.
I care so much. #IranProtests --> I care so much. Iran Protests
Arguably, the use of a stand-alone hashtag approximately refers to a proper noun, in which case the adopted capitalization is meaningful.