Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Morphology support #1

Open
vgel opened this issue Oct 19, 2020 · 1 comment
Open

Morphology support #1

vgel opened this issue Oct 19, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@vgel
Copy link
Owner

vgel commented Oct 19, 2020

Right now terminal tokens have to be separate words. Treebender should be able to support morphological rules:

V[ stem: t ] -> walk
V[ stem: t ] -> talk
// stem: f to block walkedededededededed...
V[ tense: past, stem: f ] -> V[ stem: t ] ++ ed  // syntax TBD

Questions:

  • What scope do we want here? Are we only supporting basic concatenative morphology (prefixes and suffixes), or will we try and support allomorphy, sound changes / ablaut, semitic roots...
    • It's tempting to say we just focus on English and support concatenative and allow the user to fall back with a flag:
        V[ can_inflect: y ] -> walk
        V[ can_inflect: n ] -> buy
        V[ tense: past, can_inflect: n ] -> V[ can_inflect: y ] ++ ed
        V[ tense: past, can_inflect: n ] -> bought
    + However, lots of common words in English have changes like bake ~ baked not *bakeed. There's no real way to support that without some more sophisticated tool or tons of duplicate rules.
    
    

Todo:

  • Remind myself of how the LKB does this
@vgel vgel added the enhancement New feature or request label Oct 19, 2020
@vgel
Copy link
Owner Author

vgel commented Oct 20, 2020

One way to approach this would actually be to just allow grammar files to define a token-splitting process that runs before parsing.

Something like:

$splitters = [
    /(.+)ed/ => [\1, -ed]
    /(.+)d/  =>  [\1, -ed] // for words like "baked"
    /(.+)s/  => [\1, -s]
    /(.+)es/ => [\1, -s]
]

Then all possible splitters would match on a word, plus an implicit "no expansion" splitter, and split a sentence into a bunch of possible morphological derivations:

"The dogs walked to the beach and baked"
"The dogs walk -ed to the beach and baked"
"The dogs walke -ed to the beach and baked"
"The dog -s walked to the beach and baked"
"The dog -s walk -ed to the beach and baked"
"The dog -s walke -ed to the beach and baked"
"The dogs walked to the beach and bak -ed"
"The dogs walk -ed to the beach and bak -ed"
"The dogs walke -ed to the beach and bak -ed"
"The dog -s walked to the beach and bak -ed"
"The dog -s walk -ed to the beach and bak -ed"
"The dog -s walke -ed to the beach and bak -ed"
"The dogs walked to the beach and bake -ed"
"The dogs walk -ed to the beach and bake -ed"
"The dogs walke -ed to the beach and bake -ed"
"The dog -s walked to the beach and bake -ed"
==> "The dog -s walk -ed to the beach and bake -ed"
"The dog -s walke -ed to the beach and bake -ed"

Obviously this has the potential to blow up, but we could also fail fast if a splitter generates a token that doesn't match any nonterminals in the grammar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant