Add support for Indic languages (Hindi) in IPA G2P by quapham · Pull Request #15371 · NVIDIA-NeMo/NeMo

quapham · 2026-02-09T03:38:56Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

This PR extends the IPA G2P system to support Indic languages (Hindi).

Collection: TTS, common

Changelog

Add Unicode character ranges for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Gujarati)
Extend word-level regex to support Indic scripts in IPA G2P
Update character and punctuation regex to include Indic characters
Add Hindi IPA dictionary support

Usage

You can potentially add a usage example below

hindi_phoneme:
      _target_: nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer
      locale: hi-IN
      punct: true
      apostrophe: true
      pad_with_space: true
      g2p:
        _target_: nemo.collections.tts.g2p.models.i18n_ipa.IpaG2p
        locale: 'hi-IN'
        phoneme_dict: "scripts/tts_dataset_files/hi_IN/hi_en_prondict-v0.1.dict"
        phoneme_probability: 0.8
        use_chars: true
        use_stresses: false
        grapheme_case: upper
        grapheme_prefix: '#'

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: quanpham <youngkwan199@gmail.com>

Signed-off-by: quapham <quapham@users.noreply.github.com>

XuesongYang · 2026-02-09T19:43:22Z

scripts/tts_dataset_files/hi_IN/hi_en_prondict-v0.1.dict

is there any reason why this dictionary contains both English and Hindi? I expect hindi dict only contains hindi entries.

We want to support mixed Hindi and English. Since Hindi does not use Latin characters, we also need to include English characters in the dictionary.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T17:56:38Z

nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py

+# Indic characters based on https://www.unicode.org/charts/
+DEVANAGARI_CHARS = (
+    r'\u0900-\u097F'  # Hindi, Marathi, Nepali, Sanskrit https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)
+)
+BENGALI_CHARS = r'\u0980-\u09FF'  # Bengali, Assamese
+TAMIL_CHARS = r'\u0B80-\u0BFF'  # Tamil
+TELUGU_CHARS = r'\u0C00-\u0C7F'  # Telugu
+KANNADA_CHARS = r'\u0C80-\u0CFF'  # Kannada
+GUJARATI_CHARS = r'\u0A80-\u0AFF'  # Gujarati
+INDIC_CHARS_ALL = f"{DEVANAGARI_CHARS}{BENGALI_CHARS}{TAMIL_CHARS}{TELUGU_CHARS}{KANNADA_CHARS}{GUJARATI_CHARS}"


INDIC_CHARS_ALL uses entire Unicode blocks (e.g., \u0900-\u097F), which includes punctuation such as Devanagari danda । (U+0964). Because _WORDS_RE_ANY_LOCALE treats everything in INDIC_CHARS_ALL as part of a “word”, strings like दुनिया। will be tokenized as a single word and won't match phoneme-dict entries for दुनिया. Consider narrowing these ranges to letters/marks (and optionally digits) and explicitly excluding script punctuation like ।/॥ so they tokenize as punctuation separators.

Copilot · 2026-02-11T17:56:39Z

nemo/collections/tts/g2p/models/i18n_ipa.py

    STRESS_SYMBOLS = ["ˈ", "ˌ"]
    # Regex for roman characters, accented characters, and locale-agnostic numbers/digits
-    CHAR_REGEX = re.compile(fr"[{LATIN_CHARS_ALL}\d]")
-    PUNCT_REGEX = re.compile(fr"[^{LATIN_CHARS_ALL}\d]")
+    CHAR_REGEX = re.compile(fr"[{LATIN_CHARS_ALL}{INDIC_CHARS_ALL}\d]")
+    PUNCT_REGEX = re.compile(fr"[^{LATIN_CHARS_ALL}{INDIC_CHARS_ALL}\d]")


CHAR_REGEX/PUNCT_REGEX are expanded to include INDIC_CHARS_ALL, but since INDIC_CHARS_ALL currently includes Devanagari punctuation (e.g., ।), parse_one_word() will no longer recognize those symbols as punctuation-only tokens (CHAR_REGEX.search() will match). This can cause Hindi punctuation to be treated as part of a word and lead to OOV fallbacks instead of dictionary lookups. Recommend switching to an Indic letter/mark set (excluding danda/double danda) for CHAR_REGEX and treating those punctuation marks via the punctuation path.

Copilot · 2026-02-11T17:56:39Z

nemo/collections/tts/g2p/models/i18n_ipa.py

                        or 'a' <= line[0] <= 'z'
                        or 'À' <= line[0] <= 'Ö'
                        or 'Ø' <= line[0] <= 'ö'
                        or 'ø' <= line[0] <= 'ÿ'
+                        or _INDIC_PATTERN.match(line[0])
                        or line[0] == "'"
                    ):
                        parts = line.strip().split(maxsplit=1)


_INDIC_PATTERN is built from INDIC_CHARS_ALL and used to decide whether a dictionary line begins with a “word”. Because the underlying ranges include non-letters (digits/punctuation inside the script blocks), this may incorrectly accept lines that start with script punctuation (or digits) as lexicon entries. Tightening the Indic ranges (letters/marks only, excluding danda) will make dictionary parsing safer and consistent with tokenization.

Copilot · 2026-02-11T17:56:39Z

tests/collections/common/tokenizers/text_to_speech/test_tts_tokenizers.py

+    def test_ipa_tokenizer_hi_in(self):
+        input_text = "नमस्ते दुनिया"
+        expected_output = "nəmˈʌsteː dˈʊnɪjˌaː"
+        g2p = IpaG2p(phoneme_dict=self.PHONEME_DICT_HI, locale="hi-IN")
+        tokenizer = IPATokenizer(g2p=g2p, locale="hi-IN")
+        chars, tokens = self._parse_text(tokenizer, input_text)
+        assert chars == expected_output


The new Hindi test covers basic word lookup, but it doesn’t cover the main edge case introduced by the regex updates: Indic punctuation separation (e.g., नमस्ते। should still map नमस्ते via the dict and keep । as punctuation). Adding a unit test for this will prevent regressions if the Indic character ranges are refined to exclude danda/double danda.

XuesongYang · 2026-02-11T20:54:01Z

some copilot's comments seem reasonable. Could you pls have a look and ignore if they are irrelevant. Thanks!

quapham and others added 4 commits January 27, 2026 08:42

Hindi IPAG2P

633adb7

Signed-off-by: quanpham <youngkwan199@gmail.com>

Update Hindi IpaG2p: add Indic character pattern support

385e555

Signed-off-by: quanpham <youngkwan199@gmail.com>

Merge branch 'NVIDIA-NeMo:main' into Hindi_IPA

30379a8

Merge branch 'NVIDIA-NeMo:main' into Hindi_IPA

f43712b

github-actions bot added TTS common labels Feb 9, 2026

Apply isort and black reformatting

1909639

Signed-off-by: quapham <quapham@users.noreply.github.com>

XuesongYang requested review from XuesongYang and Copilot February 9, 2026 18:06

Copilot started reviewing on behalf of XuesongYang February 9, 2026 19:34 View session

XuesongYang added Run CICD skip-linting labels Feb 9, 2026

XuesongYang temporarily deployed to test February 9, 2026 19:36 — with GitHub Actions Inactive

XuesongYang reviewed Feb 9, 2026

View reviewed changes

Copilot AI reviewed Feb 9, 2026

View reviewed changes

XuesongYang removed the skip-linting label Feb 11, 2026

XuesongYang requested a review from Copilot February 11, 2026 17:49

Copilot started reviewing on behalf of XuesongYang February 11, 2026 17:49 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Indic languages (Hindi) in IPA G2P#15371

Add support for Indic languages (Hindi) in IPA G2P#15371
quapham wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
quapham:Hindi_IPA

quapham commented Feb 9, 2026

Uh oh!

XuesongYang Feb 9, 2026

Uh oh!

quapham Feb 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

XuesongYang commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

quapham commented Feb 9, 2026

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

XuesongYang Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

quapham Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

XuesongYang commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants