Add support for Indic languages (Hindi) in IPA G2P#15371
Add support for Indic languages (Hindi) in IPA G2P#15371quapham wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quapham <quapham@users.noreply.github.com>
There was a problem hiding this comment.
is there any reason why this dictionary contains both English and Hindi? I expect hindi dict only contains hindi entries.
There was a problem hiding this comment.
We want to support mixed Hindi and English. Since Hindi does not use Latin characters, we also need to include English characters in the dictionary.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Indic characters based on https://www.unicode.org/charts/ | ||
| DEVANAGARI_CHARS = ( | ||
| r'\u0900-\u097F' # Hindi, Marathi, Nepali, Sanskrit https://en.wikipedia.org/wiki/Devanagari_(Unicode_block) | ||
| ) | ||
| BENGALI_CHARS = r'\u0980-\u09FF' # Bengali, Assamese | ||
| TAMIL_CHARS = r'\u0B80-\u0BFF' # Tamil | ||
| TELUGU_CHARS = r'\u0C00-\u0C7F' # Telugu | ||
| KANNADA_CHARS = r'\u0C80-\u0CFF' # Kannada | ||
| GUJARATI_CHARS = r'\u0A80-\u0AFF' # Gujarati | ||
| INDIC_CHARS_ALL = f"{DEVANAGARI_CHARS}{BENGALI_CHARS}{TAMIL_CHARS}{TELUGU_CHARS}{KANNADA_CHARS}{GUJARATI_CHARS}" |
There was a problem hiding this comment.
INDIC_CHARS_ALL uses entire Unicode blocks (e.g., \u0900-\u097F), which includes punctuation such as Devanagari danda । (U+0964). Because _WORDS_RE_ANY_LOCALE treats everything in INDIC_CHARS_ALL as part of a “word”, strings like दुनिया। will be tokenized as a single word and won't match phoneme-dict entries for दुनिया. Consider narrowing these ranges to letters/marks (and optionally digits) and explicitly excluding script punctuation like ।/॥ so they tokenize as punctuation separators.
| STRESS_SYMBOLS = ["ˈ", "ˌ"] | ||
| # Regex for roman characters, accented characters, and locale-agnostic numbers/digits | ||
| CHAR_REGEX = re.compile(fr"[{LATIN_CHARS_ALL}\d]") | ||
| PUNCT_REGEX = re.compile(fr"[^{LATIN_CHARS_ALL}\d]") | ||
| CHAR_REGEX = re.compile(fr"[{LATIN_CHARS_ALL}{INDIC_CHARS_ALL}\d]") | ||
| PUNCT_REGEX = re.compile(fr"[^{LATIN_CHARS_ALL}{INDIC_CHARS_ALL}\d]") |
There was a problem hiding this comment.
CHAR_REGEX/PUNCT_REGEX are expanded to include INDIC_CHARS_ALL, but since INDIC_CHARS_ALL currently includes Devanagari punctuation (e.g., ।), parse_one_word() will no longer recognize those symbols as punctuation-only tokens (CHAR_REGEX.search() will match). This can cause Hindi punctuation to be treated as part of a word and lead to OOV fallbacks instead of dictionary lookups. Recommend switching to an Indic letter/mark set (excluding danda/double danda) for CHAR_REGEX and treating those punctuation marks via the punctuation path.
| or 'a' <= line[0] <= 'z' | ||
| or 'À' <= line[0] <= 'Ö' | ||
| or 'Ø' <= line[0] <= 'ö' | ||
| or 'ø' <= line[0] <= 'ÿ' | ||
| or _INDIC_PATTERN.match(line[0]) | ||
| or line[0] == "'" | ||
| ): | ||
| parts = line.strip().split(maxsplit=1) |
There was a problem hiding this comment.
_INDIC_PATTERN is built from INDIC_CHARS_ALL and used to decide whether a dictionary line begins with a “word”. Because the underlying ranges include non-letters (digits/punctuation inside the script blocks), this may incorrectly accept lines that start with script punctuation (or digits) as lexicon entries. Tightening the Indic ranges (letters/marks only, excluding danda) will make dictionary parsing safer and consistent with tokenization.
| def test_ipa_tokenizer_hi_in(self): | ||
| input_text = "नमस्ते दुनिया" | ||
| expected_output = "nəmˈʌsteː dˈʊnɪjˌaː" | ||
| g2p = IpaG2p(phoneme_dict=self.PHONEME_DICT_HI, locale="hi-IN") | ||
| tokenizer = IPATokenizer(g2p=g2p, locale="hi-IN") | ||
| chars, tokens = self._parse_text(tokenizer, input_text) | ||
| assert chars == expected_output |
There was a problem hiding this comment.
The new Hindi test covers basic word lookup, but it doesn’t cover the main edge case introduced by the regex updates: Indic punctuation separation (e.g., नमस्ते। should still map नमस्ते via the dict and keep । as punctuation). Adding a unit test for this will prevent regressions if the Indic character ranges are refined to exclude danda/double danda.
|
some copilot's comments seem reasonable. Could you pls have a look and ignore if they are irrelevant. Thanks! |
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
This PR extends the IPA G2P system to support Indic languages (Hindi).
Collection: TTS, common
Changelog
Usage
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information