Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with diacritics and transliterating to lists #174

Open
drisme opened this issue Apr 4, 2024 · 1 comment
Open

Problem with diacritics and transliterating to lists #174

drisme opened this issue Apr 4, 2024 · 1 comment

Comments

@drisme
Copy link

drisme commented Apr 4, 2024

I've run into some issues in several languages, where diacritics lead to strange behavior.

Example in French:

lang_code = 'fra-Latn' 
epi = epitran.Epitran(lang_code)
print(epi.trans_list(u"mobilisèrent"))
print(epi.trans_delimiter(u"mobilisèrent"))
print(epi.trans_delimiter(u"mobilisèrent", delimiter='~'))

which yields the outputs

['m', 'ɔ', 'b', 'i', 'l', 'i', 'z', 'ə', '̀', 'ʀ', 'ɑ̃']
m ɔ b i l i z ə ̀ ʀ ɑ̃
m~ɔ~b~i~l~i~z~ə~̀~ʀ~ɑ̃

when using space as a delimiter the diacritic attaches itself to the next letter, when using any other delimiter like tilder, it outputs an extra delimiter and the diacritic then modifies the delimiter (in this case a tilder, but the same happens with any chosen delimiter).

This happens in other languages as well, so far I've tried Portuguese, Italian, same thing.

Is this expected behavior or is there some kind of trick I am unaware of? To my understanding a diacritic is not considered an additional phoneme, but instead a modifier. I also understand that unicode uses a postfix notation for diacritics, so is this perhaps an encoding issue?

@drisme
Copy link
Author

drisme commented Apr 5, 2024

Ok, for anyone facing this same issue, I have written a solution for postprocessing the delimited strings:

def split_ipa(transliterated_text, delimiter='|'):
    # Split the string based on the specified delimiter
    parts = transliterated_text.split(delimiter)

    # Initialize an empty list to hold the corrected segments
    corrected_parts = []

    # Loop through the parts to reattach any diacritics to their base character
    for part in parts:
        if corrected_parts and unicodedata.category(part[0]) == 'Mn':
            # If the part starts with a diacritic, attach it to the previous part
            corrected_parts[-1] += part
        else:
            # Otherwise, add the part to the list as a new segment
            corrected_parts.append(part)

    return corrected_parts

Now if you run the following code the delimited string is correctly split:

lang_code = 'fra-Latn' 
epi = epitran.Epitran(lang_code)
enc = epi.trans_delimiter(u"mobilisèrent", delimiter='|')
print("Original split:", enc.split('|'))
print("Corrected split:", split_ipa(enc))

This outputs:

Original split: ['m', 'ɔ', 'b', 'i', 'l', 'i', 'z', 'ə', '̀', 'ʀ', 'ɑ̃']
Corrected split: ['m', 'ɔ', 'b', 'i', 'l', 'i', 'z', 'ə̀', 'ʀ', 'ɑ̃']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant