Consider adding more word lists #4

gamag · 2017-12-06T10:27:48Z

Having different sources for the word list might allow us to improve the quality of the dict by removing words that only appear in one of them and therefore might be wrong. This requires however, that the word lists used are really created from different texts.

The following word lists could be analyzed. If they are really from "disjoint" sources and found to improve the quality of our dictionary, we could include them in the default build scripts (if there are no licensing problems).

gusbemacbe (Gustavo Reis) / (on Dropbox) From Google Keyboard, Different dictionaries #2, This would confirm approx. 13700 words that are considered as "too unsure" at the moment, so they could be added to the dict. We have to check for copyright problems first.
akalongman (Avtandil Kikabidze) / geo-words Contains the wordlist from Kevin Scannell and words from National Parliamentary Library of Georgia among others. We need to take them out, real or numerically, to use this list.
0xh3x (Giorgi Jvaridze) / scraped-words Again part wise from sources we already have.
sandrinio (Sandro Sukhitashvili) / Scraped / GeoWordsDatabase Sources unclear (maybe they are written in the data files). In XML and json format, so not really convenient to use in bash scripts...

gusbemacbe · 2017-12-06T16:08:31Z

Would you like me to fork and contribute with your project?

Use Sublime Text and use Regex to Sandro's dictionary to select the brackets and all commas and remove JSON brackets and comma, and after press enter. And in XML, use Regex to select and remove all string tags. Observe that Atom isn't good for big files to select and remove.

After use compare plugins in Atom, in Sublime Text and in Visual Studio Code to compare the dictionaries:

0xh3x · 2017-12-06T17:10:13Z

Regarding the 0xh3x (Giorgi Jvaridze) / scraped-words
It includes word scraped from these sources:

Georgian wiki taken from http://dumps.wikimedia.org/
http://buki.ge/library-list.html
http://lib.ge/
http://www.nplg.gov.ge/dlibrary/
http://forum.ge/

One thing to keep in mind is that scraping was done 6 years ago. So even if some other wordlists might contain same sources, they might be more recent and contain more words.
blogpost with more info

gamag · 2017-12-06T21:06:26Z

Would you like me to fork and contribute with your project?

If you like to help, I'd be happy! - however comparing the word list and my dictionary doesn't make much sense - if you really want to go further than just removing the json formatting, you'll need to be able to run the build scripts (see README.md for a short description of the requirements). Then comparing has to be done to Bumbeishvili's words from the database and to Scannell's in words/ (Read the readme file everywhere - there is not much documentation, but it might help a little bit.) This comparison can be done automatically.

Before you make any effort to include a word list, make sure that it is published with a license that allows us to integrate it into the dict and distribute it under MIT license (contact the author if needed - as it is the case with sandrinio)

Sorry for the long text - I just want to avoid that you do your work in vain in the end...

gusbemacbe · 2017-12-07T07:01:30Z

I checked GBoard on my mobile, at "About", it is said to be under open-source licences, some components under OFL, some others under BSD licence, some under Apache licence. I extracted the GBoard and found the licence. It is said that GBoard is under OFL licence. I attached GBoard's licence file for you.

LICENSE_OFL.txt

gamag · 2017-12-07T09:52:43Z

OFL = Open Font License - so probably GBoard contains some fonts under this license. The software itself seems to be proprietary, using libraries under BSD and Apache.

gusbemacbe · 2017-12-07T16:57:08Z

Ah, I was wrong, then I mean all libraries are under BSD and Apache licences. But I'll contact Google team and return their answer to you, OK?

ottoshmidt · 2017-12-14T13:06:38Z

მოგესალმებით, კარგი წამოწყებაა! ვეცდები ჩავერთო ლექსიკონის დახვეწის საქმეში.
ერთი შეკითხვა, ლექსიკონში (ka_GE.dic) არის ასეთი ჩანაწერები, მაგ: "ჰექტარი/NSNN". აქ NSNN რას ნიშნავს? საჭიროა?

gusbemacbe · 2017-12-14T13:17:53Z

მოგესალმებით, კარგი წამოწყებაა! ვეცდები ჩავერთო ლექსიკონის დახვეწის საქმეში.
ერთი შეკითხვა, ლექსიკონში (ka_GE.dic) არის ასეთი ჩანაწერები, მაგ: "ჰექტარი/NSNN". აქ NSNN რას ნიშნავს? საჭიროა?

დიახ, არის საჭიროა! "/NSNN" არის ლექსიკონის "აფიქსი" ("affix", და ვხედავ ka.aff).

ottoshmidt · 2017-12-14T13:35:45Z

დიახ, არის საჭიროა! "/NSNN" არის ლექსიკონის "აფიქსი" ("affix", და ვხედავ ka.aff).

მაგეებში უნდა გავერკვე.

gamag · 2017-12-17T15:00:58Z

man 5 hunspell-ში წერია affix-compression როგორ მუშაობს.

ottoshmidt · 2017-12-19T14:03:26Z

spell.on.ge - ამათ ხომ არ დაკავშირებიხართ, იქნებ გაგეერთიანებინათ ძალები?

gamag mentioned this issue Dec 6, 2017

Different dictionaries #2

Closed

gamag changed the title ~~Consider adding more word listst~~ Consider adding more word lists Dec 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding more word lists #4

Consider adding more word lists #4

gamag commented Dec 6, 2017

gusbemacbe commented Dec 6, 2017

0xh3x commented Dec 6, 2017

gamag commented Dec 6, 2017

gusbemacbe commented Dec 7, 2017

gamag commented Dec 7, 2017

gusbemacbe commented Dec 7, 2017 •

edited

Loading

ottoshmidt commented Dec 14, 2017

gusbemacbe commented Dec 14, 2017

ottoshmidt commented Dec 14, 2017

gamag commented Dec 17, 2017

ottoshmidt commented Dec 19, 2017

Consider adding more word lists #4

Consider adding more word lists #4

Comments

gamag commented Dec 6, 2017

gusbemacbe commented Dec 6, 2017

Atom

Sublime Text

Visual Studio Code

0xh3x commented Dec 6, 2017

gamag commented Dec 6, 2017

gusbemacbe commented Dec 7, 2017

gamag commented Dec 7, 2017

gusbemacbe commented Dec 7, 2017 • edited Loading

ottoshmidt commented Dec 14, 2017

gusbemacbe commented Dec 14, 2017

ottoshmidt commented Dec 14, 2017

gamag commented Dec 17, 2017

ottoshmidt commented Dec 19, 2017

gusbemacbe commented Dec 7, 2017 •

edited

Loading