Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary structure #55

Open
daxida opened this issue Jun 3, 2024 · 3 comments
Open

Dictionary structure #55

daxida opened this issue Jun 3, 2024 · 3 comments
Labels
question Further information is requested

Comments

@daxida
Copy link

daxida commented Jun 3, 2024

I'm sorry if this is not the right place to ask.

I recently found this repository via wiktextract and I would like to do something similar for another API. I was browsing for a while but I could not find a description of the JSON entries that are used, like this one here.

I understand that the expected dictionary from Yomitan is something of the likes of:

[
  [
    word,
    "",  # what is this?
    "v vt", # some grammar tag but not sure about the difference with the next one.
    "v",
    0, # what is this?
    list of translations,
    0, # what is this?
    "" # what is this?
  ],
  etc.
]

Could you give me some headers? I've also tried the Yomitan repo but I could not find much information about it. Maybe it's a standard dictionary format that I'm unaware of?

@StefanVukovic99
Copy link
Collaborator

It's described by a schema in yomitan: https://github.com/themoeway/yomitan/blob/master/ext/data/schemas/dictionary-term-bank-v3-schema.json
Some of the fields are pretty much obsolete I'd say. There's other schemas in that folder for the IPA, index.json etc.

Just to make sure you're not doing more than you need, this is converting something other than kaikki, i.e. this is separate from tatuylonen/wiktextract#651?

@daxida
Copy link
Author

daxida commented Jun 3, 2024

Thank you for the link.

I'm unfortunately still having trouble parsing that schema. Is it obvious from that what maps to what in the lines that I previously commented?

And thank you for your concern: this is a separate matter. I didn't mention it before because I was afraid to be instantly dismissed for being out of topic. I'm toying with the idea of making a Yomitan-compatible dictionary like yours from a website called lingq.

Their entries are very simple in comparison to that schema:

{
  "pk": 459243703,
  "url": "https://www.lingq.com/api/v3/el/cards/459243703/",
  "term": "εκφώνησής",
  "fragment": "διαδικασία προγραμματισμού της εκφώνησής σας, να",
  "importance": 0,
  "status": 0,
  "extended_status": null,
  "last_reviewed_correct": null,
  "srs_due_date": "2023-09-12T08:26:23.907721",
  "notes": "",
  "audio": null,
  "words": [
    "εκφώνησής"
  ],
  "tags": [],
  "hints": [
    {
      "id": 129173102,
      "locale": "en",
      "text": "of reading (aloud)",
      "term": "εκφώνησής",
      "popularity": 2,
      "is_google_translate": true,
      "flagged": false
    }
  ],
  "transliteration": {
    "latin": [
      "ekfonisis"
    ]
  },
  "gTags": [],
  "wordTags": [],
  "readings": {

  },
  "writings": [
    "εκφώνησής",
    "εκφωνησης"
  ]
}

There are some things like fragments (sort of "example sentence") that I'm still not sure where to put.

@StefanVukovic99
Copy link
Collaborator

StefanVukovic99 commented Jun 3, 2024

Here's some more details:

[
  "居住者",
  "きょじゅうしゃ",
  "n",
  "",
  604,
  [
    "resident",
    "inhabitant"
  ],
  1717870,
  "P news"
]

Screenshot from 2024-06-03 22-21-12

  1. The term/expression/headword
  2. The "reading" - in Japanese, this is the term in kana, used to disambiguate readings. In other languages it can be used in a similar way, or display the term with optional diacritics. E.g. in latin and farsi:
    latin occido
    farsi
  3. Definition tags - these are abbreviations refering to the full tags defined in tag_bank_1.json. They can be about the part of speech, but also usage qualifiers (rare, archaic, vulgar...), field (law, biology, astronomy...), region (British/American and such). When you click on them, the full tag name is shown.
  4. Rule identifiers - these refer to the conditions defined in the "transforms" (aka deinflections) file for that language, see english-transforms.js , and help deinflection be more precise. If a language has no deinflection yet, they are unnecessary.
  5. Score - basically vestigial IMO, obsoleted by freq dict use for sorting
  6. Array of definitions. Note that these can be simple strings, but also "structured" (HTML, lets you make fancy definitions. Might want to use this to format your example sentences and whatnot) and "deinflection definitions" (redirects to another dict entry. can be used for conjugated forms, alternate written forms...)
  7. Sequence number -
{
    "type": "integer",
    "description": "Sequence number for the term. Terms with the same sequence number can be shown together when the \"resultOutputMode\" option is set to \"merge\"."
},

idk really, probably safe to ignore, just set it to 0
8. These tags also refer to the tag_bank, but they are supposed to be related to the term, not the definition (See first image). I don't think these see much use these days.

@StefanVukovic99 StefanVukovic99 added the question Further information is requested label Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants