Skip to content

Segmentation: Preserve case? #19

@davidbernat

Description

@davidbernat

The Segmentation tool you provide is excellent. One feature request:

Unless I am mistaken, the tool always provided the split words in 1. lower case, and 2. does not provide information for where spaces were inserted. Instead, a preserve_case or capitalize parameter would be helpful (for 1). The following code capitalizes the split string according to the capitalization used in the hashtag.


from ekphrasis.classes.segmenter import Segmenter
segmenter = Segmenter(corpus="twitter")


def word_segmentation(text, fix_case=True):
    words_string = segmenter.segment(text)
    if not fix_case:
        return words_string

    fixed = ""

    n_add = 0
    for i in range(len(words_string)):
        if words_string[i] == " " and text[i+n_add] != " ":
            n_add += 1
            fixed += " "
            continue

        is_capital = text[i-n_add].isupper()
        if is_capital:
            fixed += words_string[i].upper()
        else:
            fixed += words_string[i]
    return fixed

Of course, if the user is using camelCase or PascalCase, the capitalization may not be meaningful, but in other cases, this can be. For instance:

I #eatsomuch food --> I eat so much food.
I care so much. #IranProtests --> I care so much. Iran Protests

Arguably, the use of a stand-alone hashtag approximately refers to a proper noun, in which case the adopted capitalization is meaningful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions