Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help: Code that fixes the issue of inflections on Kindle dictionaries #1

Open
pyccp opened this issue Jun 2, 2022 · 7 comments
Open
Assignees

Comments

@pyccp
Copy link

pyccp commented Jun 2, 2022

Hello Hannes,

I've been trying to create a decent Kindle dictionary based on Wiktionary [English-Finnish] for some time and I came across the issue of inflections clashing with headwords when I tried to create the dictionary like you did, I even created a thread on mobileread two days ago about this topic.

I see that you have solution, an algorithm to fix this issue. Can you please explain how you solved this issue? I've read the document titled "The stupid kindle algorithm" and you mention that it can be fixed with three lines of code, can you please share this code and explain how this code can be used to fix the issue?

@Vuizur
Copy link
Owner

Vuizur commented Jun 3, 2022

Hello,

my solution is here. Note that you need to replace 1 function and a constant in pyglossary as described at the top of the file.
It is also possible that instead of using unidecode, you need to use another function to strip the diacritics that the Kindle fuzzy search algorithm ignores (I used it because it works for Spanish, but it probably doesn't work for many other languages).

The best future scenario would be if we could add this to pyglossary directly.

@Vuizur
Copy link
Owner

Vuizur commented Jul 26, 2022

Here is a project that converts a tabfile to a fixed kindle dictionary: https://github.com/Vuizur/pyglossary-kindle-test/tree/master/pyglossary_kindle_test
(Install it using poetry install, and execute poetry run python ./pyglossary_kindle_test/edit_dictionary.py (In the script file you can specify the kindlegen folder and your tabfile path.)

@pyccp
Copy link
Author

pyccp commented Aug 31, 2022

I have created a short tab (and | ) separated txt file for Finnish-English and tested it, it works brilliantly, though some of the words shows up twice in the dictionary. But I suspect this is an expected behavior. Thank you.

Checked words below showed up twice in the dictionary:


  • paatti <p>ship or boat</p>
  • päättää|päätti <p>to decide; to choose</p>
  • silta <p>bridge</p>
  • se|siltä <p>it</p>
  • saari <p>island</p>
  • sääri|sääret <p>shin</p>
  • tuli <p>fire</p>
  • tulla|tuli <p>to come</p>

@pyccp pyccp closed this as completed Aug 31, 2022
@pyccp
Copy link
Author

pyccp commented Sep 1, 2022

With great regret, anguish and disappointment, I must tell you that after creating a Finnish-English dictionary many times from a 143 250 line txt file, the look-ups for inflected forms of the words failed almost entirely.

I checked the txt file to see if anything is wrong with the formatting and also checked the xhtml files, and inflected forms were recorded inside infl tags.

Look-up for inflections only worked when txt file were small, for example I selected 35 words out of 143 250 and created a 35-line txt to make a dictionary and it worked.

I also always get this message at the beginning after executing poetry run python ./pyglossary_kindle_test/edit_dictionary.py command:

No module named 'pyglossary.plugin_lib.py310'

@pyccp pyccp reopened this Sep 1, 2022
@Vuizur
Copy link
Owner

Vuizur commented Sep 3, 2022

The error message about No module named 'pyglossary.plugin_lib.py310' should not be a problem, in my experience pyglossary still works the same when it is thrown.

If I understand it correctly you tried to use these dictionaries on kindle? I think it might have problems with the huge number of inflections. I would try to re-run the program like described in the README.md of this repo with the option try_to_fix_failed_inflections set to False. Maybe kindle will work better with this one.

@pyccp
Copy link
Author

pyccp commented Sep 3, 2022

Yes, I am sending the created dictionaries to my Kindle e-reader to test them. I already have the same dictionary with 143 250 entries that I am trying to fix in my e-reader, the device handles it with all inflections. I created it using mobigen (much faster than kindlegen), it is about 10 MB. But of course the inflections are messed up because of the Kindle algorithm and headwords clash with inflections.

I later created sample dictionaries with 100, 1000, and 10 000 entries using pyglossary-kindle-test repository, all seemed to work fine. I also created Spanish-English dictionary using the En-Es.txt file that comes with that repo, and this one, too, worked fine. But not the Finnish dictionary with all entries included.

A moment ago I tried ebook_dictionary_creator repo to create a Finnish dictionary and I got this error:

Traceback (most recent call last):
  File "C:\Users\user\Desktop\Py Project\Project Dictio\trial\dictio.py", line 5, in 
<module>
    dict_creator.create_database()
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\ebook_dictionary_creator\e_dictionary_creator\dictionary_creator.py", line 64, in create_database
    create_database.create_database(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\ebook_dictionary_creator\database_creator\create_database.py", line 728, in create_database
    obj = json.loads(line)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", 
line 346, in loads
    return _default_decoder.decode(s)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\json\decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 266 (char 265)
Traceback locals:
    self = <json.decoder.JSONDecoder object at 0x0000020371ABFE50>
    s = '{"pos": "noun", "head_templates": [{"name": "head", "args": {"1": "f...      
    len(s) = 277
    idx = 0```

@Vuizur Vuizur self-assigned this Feb 25, 2023
@Vuizur
Copy link
Owner

Vuizur commented Feb 25, 2023

I later created sample dictionaries with 100, 1000, and 10 000 entries using pyglossary-kindle-test repository, all seemed to work fine. I also created Spanish-English dictionary using the En-Es.txt file that comes with that repo, and this one, too, worked fine. But not the Finnish dictionary with all entries included.

I think Finnish simply has too many inflections for that terrible kindlegen program. If you hit some completely arbitrary limits, it will refuse to work correctly, and also gives you no hint on how to fix this or which entry exactly is responsible for the failure. Very bad software, but unfortunately there is no solution.

Hmm, I tried creating a Finnish dictionary on my Windows system, but here it worked well. So I would need system/Python version info to maybe replicate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants