-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Add support for Indic languages (Hindi) in IPA G2P #15371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
633adb7
385e555
30379a8
f43712b
1909639
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,6 +20,7 @@ | |
|
|
||
| from nemo.collections.common.tokenizers.text_to_speech.ipa_lexicon import validate_locale | ||
| from nemo.collections.common.tokenizers.text_to_speech.tokenizer_utils import ( | ||
| INDIC_CHARS_ALL, | ||
| LATIN_CHARS_ALL, | ||
| any_locale_word_tokenize, | ||
| english_word_tokenize, | ||
|
|
@@ -29,13 +30,16 @@ | |
| from nemo.collections.tts.g2p.utils import GRAPHEME_CASE_MIXED, GRAPHEME_CASE_UPPER, set_grapheme_case | ||
| from nemo.utils import logging | ||
|
|
||
| # Compiled regex pattern for Indic scripts (used in dictionary parsing) | ||
| _INDIC_PATTERN = re.compile(f'^[{INDIC_CHARS_ALL}]') | ||
|
|
||
|
|
||
| class IpaG2p(BaseG2p): | ||
| # fmt: off | ||
| STRESS_SYMBOLS = ["ˈ", "ˌ"] | ||
| # Regex for roman characters, accented characters, and locale-agnostic numbers/digits | ||
| CHAR_REGEX = re.compile(fr"[{LATIN_CHARS_ALL}\d]") | ||
| PUNCT_REGEX = re.compile(fr"[^{LATIN_CHARS_ALL}\d]") | ||
| CHAR_REGEX = re.compile(fr"[{LATIN_CHARS_ALL}{INDIC_CHARS_ALL}\d]") | ||
| PUNCT_REGEX = re.compile(fr"[^{LATIN_CHARS_ALL}{INDIC_CHARS_ALL}\d]") | ||
|
Comment on lines
39
to
+42
|
||
| # fmt: on | ||
|
|
||
| def __init__( | ||
|
|
@@ -190,6 +194,7 @@ def _parse_phoneme_dict( | |
| or 'À' <= line[0] <= 'Ö' | ||
| or 'Ø' <= line[0] <= 'ö' | ||
| or 'ø' <= line[0] <= 'ÿ' | ||
| or _INDIC_PATTERN.match(line[0]) | ||
| or line[0] == "'" | ||
| ): | ||
| parts = line.strip().split(maxsplit=1) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INDIC_CHARS_ALLuses entire Unicode blocks (e.g.,\u0900-\u097F), which includes punctuation such as Devanagari danda।(U+0964). Because_WORDS_RE_ANY_LOCALEtreats everything inINDIC_CHARS_ALLas part of a “word”, strings likeदुनिया।will be tokenized as a single word and won't match phoneme-dict entries forदुनिया. Consider narrowing these ranges to letters/marks (and optionally digits) and explicitly excluding script punctuation like।/॥so they tokenize as punctuation separators.