The model trained for a non-english language is converting the single lower case '' i " into upper case " I " #61

preniqivjosa · 2020-07-17T09:15:15Z

Hi,
I am using punctuator2 library to train a model for Albanian Language which is part of Indo-European languages with latin-derived alphabet.

I use 206,000 articles from an Albanian magazine. So my corpus is large enough to train the model.
I have successfully trained the model and I am satisfied with the results. However, when I test the model for a random text, it converts all the single lower case " i-s " into upper case " I ". In Albanian language, a single " i " within a sentence represents a conjunction which should be written in lowercase. So this made me think that the model somehow is using something pre-trained or hardcoded from english language (which I am not aware of).

I checked the code (data.py, models.py and main.py) but I could not notice anything hardcoded for that matter, except the "We.pcl" file referenced in the code which does not exist on my path since I do not use it.
Do you have any suggestion or idea why is this happening?

ottokart · 2020-07-17T12:56:49Z

Hi,

are you using convert_to_readable.py or demo_play_with_model.py scripts? These two convert the first letter of the first word in each sentence to uppercase ("Title"-case or .title() in python)

preniqivjosa · 2020-07-17T13:51:16Z

Hi @ottokart,
Thank you for the reply!
I was using a different script created for testing the model, but the problem is solved when using demo_play_with_model.py.

preniqivjosa changed the title ~~The model trained for non-english language is converting lower case '' i " into upper case " I "~~ The model trained for a non-english language is converting lower case '' i " into upper case " I " Jul 17, 2020

preniqivjosa changed the title ~~The model trained for a non-english language is converting lower case '' i " into upper case " I "~~ The model trained for a non-english language is converting the single lower case '' i " into upper case " I " Jul 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The model trained for a non-english language is converting the single lower case '' i " into upper case " I " #61

The model trained for a non-english language is converting the single lower case '' i " into upper case " I " #61

preniqivjosa commented Jul 17, 2020 •

edited

Loading

ottokart commented Jul 17, 2020

preniqivjosa commented Jul 17, 2020

The model trained for a non-english language is converting the single lower case '' i " into upper case " I " #61

The model trained for a non-english language is converting the single lower case '' i " into upper case " I " #61

Comments

preniqivjosa commented Jul 17, 2020 • edited Loading

ottokart commented Jul 17, 2020

preniqivjosa commented Jul 17, 2020

preniqivjosa commented Jul 17, 2020 •

edited

Loading