Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility of using regex for generating geminates #105

Open
kudanai opened this issue Dec 29, 2021 · 2 comments
Open

Possibility of using regex for generating geminates #105

kudanai opened this issue Dec 29, 2021 · 2 comments

Comments

@kudanai
Copy link

kudanai commented Dec 29, 2021

I'm trying to write post processing rules for div-Thaa over on this fork.

The rules dicatate that for occurrences of certain graphemes އް would have effect of having the next consonant be a geminate in some situations. I can't seem to figure out if this can be done with a single regex rule with a match group or not.

For the time being I've added the cases as individual rules here. The rules in question are the ones with <AS> in them.

TL;DR;

  • Is it possible to write gemination rules in regex?
  • If not do the rules as written here make sense

Apologies in advance if this is a redundant question and I missed something in the docs.

@dmort27
Copy link
Owner

dmort27 commented Dec 29, 2021

Here are some comments:

There are two ways of writing geminate consonants in the IPA:

  1. Doubling the consonant (unless it is an affricate, in which case the plosive is doubled)
  2. Using the long mark (ː).

For reasons of parseability with PanPhon, the second solution is the approved Epitran solution (so <އް> could simply be mapped to /ː/). If you need doubling instead, you can achieve this with a regular expression and named groups, e.g.:

(?P<seg>(p|t|k)): -> \g<seg>\g<seg> \ _

will change p:, t:, and k: to pp, tt, and kk.

The prefixed \s? in your rules is not doing any good since it doesn't rule anything out—a substring either is or is not preceded by a space. In any case, you should be using Epitran with already tokenized text rather than passages with internal whitespace. Otherwise, your rules look fine.

@kudanai
Copy link
Author

kudanai commented Dec 29, 2021

Thank you for the comments.

First on the \s, they are a bit tricky in this script. The effects of the next consonant on އް can go beyond the token boundary. What this probably actually means is that I need a better tokeniser than the currently available ones. I will investigate more on this. It will be sorted out before I request a merge.

On the geminates, at first I gave this a try, which did not seem to work (I'm not sure if I'm writing that rule wrong or if the \g syntax just isn't working for me). So taking your suggestion on using the long mark, I rewrote it using the swap groups

## This did not work
<AS>(?P<seg>::consonant::) -> \g<seg>\g<seg> / _

## outputs
ނުގެންނަން މުވައްޒަފުންގެ
nuɡennammuʋa\g<seg>\g<seg>afuŋɡe

but

## This works
(?P<sw1><AS>)\s?(?P<sw2>::consonant::) -> 0 / _
<AS> -> : / (::consonant::) _ (::vowel::)

## outputs
ނުގެންނަން މުވައްޒަފުންގެ
nuɡennammuʋaz:afuŋɡe

Does this have an impact on affricates? We have two /d͡ʒ/ and /t͡ʃ/

Also, I'm hesitant to simply map އް to : - it would complicate the post processing rules since the language uses a lot long vowels, and އް can also cause pre-nasalisation or serve as a glottal stop depending on context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants