Skip to content

Add support for Indic languages (Hindi) in IPA G2P#15371

Open
quapham wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
quapham:Hindi_IPA
Open

Add support for Indic languages (Hindi) in IPA G2P#15371
quapham wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
quapham:Hindi_IPA

Conversation

@quapham
Copy link
Contributor

@quapham quapham commented Feb 9, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

This PR extends the IPA G2P system to support Indic languages (Hindi).

Collection: TTS, common

Changelog

  • Add Unicode character ranges for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Gujarati)
  • Extend word-level regex to support Indic scripts in IPA G2P
  • Update character and punctuation regex to include Indic characters
  • Add Hindi IPA dictionary support

Usage

  • You can potentially add a usage example below
hindi_phoneme:
      _target_: nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer
      locale: hi-IN
      punct: true
      apostrophe: true
      pad_with_space: true
      g2p:
        _target_: nemo.collections.tts.g2p.models.i18n_ipa.IpaG2p
        locale: 'hi-IN'
        phoneme_dict: "scripts/tts_dataset_files/hi_IN/hi_en_prondict-v0.1.dict"
        phoneme_probability: 0.8
        use_chars: true
        use_stresses: false
        grapheme_case: upper
        grapheme_prefix: '#'

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: quapham <quapham@users.noreply.github.com>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why this dictionary contains both English and Hindi? I expect hindi dict only contains hindi entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to support mixed Hindi and English. Since Hindi does not use Latin characters, we also need to include English characters in the dictionary.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +57 to +66
# Indic characters based on https://www.unicode.org/charts/
DEVANAGARI_CHARS = (
r'\u0900-\u097F' # Hindi, Marathi, Nepali, Sanskrit https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)
)
BENGALI_CHARS = r'\u0980-\u09FF' # Bengali, Assamese
TAMIL_CHARS = r'\u0B80-\u0BFF' # Tamil
TELUGU_CHARS = r'\u0C00-\u0C7F' # Telugu
KANNADA_CHARS = r'\u0C80-\u0CFF' # Kannada
GUJARATI_CHARS = r'\u0A80-\u0AFF' # Gujarati
INDIC_CHARS_ALL = f"{DEVANAGARI_CHARS}{BENGALI_CHARS}{TAMIL_CHARS}{TELUGU_CHARS}{KANNADA_CHARS}{GUJARATI_CHARS}"
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INDIC_CHARS_ALL uses entire Unicode blocks (e.g., \u0900-\u097F), which includes punctuation such as Devanagari danda (U+0964). Because _WORDS_RE_ANY_LOCALE treats everything in INDIC_CHARS_ALL as part of a “word”, strings like दुनिया। will be tokenized as a single word and won't match phoneme-dict entries for दुनिया. Consider narrowing these ranges to letters/marks (and optionally digits) and explicitly excluding script punctuation like / so they tokenize as punctuation separators.

Copilot uses AI. Check for mistakes.
Comment on lines 39 to +42
STRESS_SYMBOLS = ["ˈ", "ˌ"]
# Regex for roman characters, accented characters, and locale-agnostic numbers/digits
CHAR_REGEX = re.compile(fr"[{LATIN_CHARS_ALL}\d]")
PUNCT_REGEX = re.compile(fr"[^{LATIN_CHARS_ALL}\d]")
CHAR_REGEX = re.compile(fr"[{LATIN_CHARS_ALL}{INDIC_CHARS_ALL}\d]")
PUNCT_REGEX = re.compile(fr"[^{LATIN_CHARS_ALL}{INDIC_CHARS_ALL}\d]")
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHAR_REGEX/PUNCT_REGEX are expanded to include INDIC_CHARS_ALL, but since INDIC_CHARS_ALL currently includes Devanagari punctuation (e.g., ), parse_one_word() will no longer recognize those symbols as punctuation-only tokens (CHAR_REGEX.search() will match). This can cause Hindi punctuation to be treated as part of a word and lead to OOV fallbacks instead of dictionary lookups. Recommend switching to an Indic letter/mark set (excluding danda/double danda) for CHAR_REGEX and treating those punctuation marks via the punctuation path.

Copilot uses AI. Check for mistakes.
Comment on lines 193 to 200
or 'a' <= line[0] <= 'z'
or 'À' <= line[0] <= 'Ö'
or 'Ø' <= line[0] <= 'ö'
or 'ø' <= line[0] <= 'ÿ'
or _INDIC_PATTERN.match(line[0])
or line[0] == "'"
):
parts = line.strip().split(maxsplit=1)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_INDIC_PATTERN is built from INDIC_CHARS_ALL and used to decide whether a dictionary line begins with a “word”. Because the underlying ranges include non-letters (digits/punctuation inside the script blocks), this may incorrectly accept lines that start with script punctuation (or digits) as lexicon entries. Tightening the Indic ranges (letters/marks only, excluding danda) will make dictionary parsing safer and consistent with tokenization.

Copilot uses AI. Check for mistakes.
Comment on lines +282 to +288
def test_ipa_tokenizer_hi_in(self):
input_text = "नमस्ते दुनिया"
expected_output = "nəmˈʌsteː dˈʊnɪjˌaː"
g2p = IpaG2p(phoneme_dict=self.PHONEME_DICT_HI, locale="hi-IN")
tokenizer = IPATokenizer(g2p=g2p, locale="hi-IN")
chars, tokens = self._parse_text(tokenizer, input_text)
assert chars == expected_output
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Hindi test covers basic word lookup, but it doesn’t cover the main edge case introduced by the regex updates: Indic punctuation separation (e.g., नमस्ते। should still map नमस्ते via the dict and keep as punctuation). Adding a unit test for this will prevent regressions if the Indic character ranges are refined to exclude danda/double danda.

Copilot uses AI. Check for mistakes.
@XuesongYang
Copy link
Collaborator

some copilot's comments seem reasonable. Could you pls have a look and ignore if they are irrelevant. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants