Best practices for structuring tsv data prior to conversion #549

Victor20x · 2024-03-15T02:24:59Z

Victor20x
Mar 15, 2024

If I have raw data that I can get into tsv format in a spreadsheet program, and want to convert to slob and StarDict formats, is there a way to tell pyglossary that some items are alternative headwords and not free-standing entries?

Also, is there a standard practice for separating lines of the definition or is this just preference? I can experiment of course but if you have a second line (let's say the headword is a noun and the language requires unit words for nouns, which should appear somewhere in the entry) does this go on a new row of the tsv, or if not is it best to separate it with a carriage return or linefeed or maybe by putting the lines in html divs?

Thanks

Answered by ilius

Mar 15, 2024

If I have raw data that I can get into tsv format in a spreadsheet program, and want to convert to slob and StarDict formats, is there a way to tell pyglossary that some items are alternative headwords and not free-standing entries?

This is how we process CSV:

Column 1: main headword
Column 2: definition
Column 3 (optional): alternative headwords separated by ,

If you export from spreadsheet, it should automatically quote everything that needs quoting (For example if there are several alternative headwords, or your definition contains , or newlines)

Also, is there a standard practice for separating lines of the definition or is this just preference? I can experiment of course but if …

View full answer

ilius · 2024-03-15T07:50:54Z

ilius
Mar 15, 2024
Maintainer

If I have raw data that I can get into tsv format in a spreadsheet program, and want to convert to slob and StarDict formats, is there a way to tell pyglossary that some items are alternative headwords and not free-standing entries?

This is how we process CSV:

Column 1: main headword
Column 2: definition
Column 3 (optional): alternative headwords separated by ,

If you export from spreadsheet, it should automatically quote everything that needs quoting (For example if there are several alternative headwords, or your definition contains , or newlines)

Also, is there a standard practice for separating lines of the definition or is this just preference? I can experiment of course but if you have a second line (let's say the headword is a noun and the language requires unit words for nouns, which should appear somewhere in the entry) does this go on a new row of the tsv, or if not is it best to separate it with a carriage return or linefeed or maybe by putting the lines in html divs?

Try pasting multi-line text in spreadsheet and save as csv.
I just tried with LibreOffice Calc, and it inserts newlines in the csv, but because the it's quoted, it won't mess up the file.

Another thing, if you use common html tags, pyglossary will detect them and change the entry format to html (when you convert to StarDict or slob). Otherwise assumes it's plaintext.
This is the list of tags that are recognized: https://github.com/ilius/pyglossary/blob/master/pyglossary/entry.py#L157
We should probably document this somewhere.

If you have long definitions / translations, spreadsheet may not be the best tool to edit.
And the quoting makes it hard to edit the CSV manually.
I would suggest looking into dictfile:
https://pgaskin.net/dictutil/dictgen/#dictfile-format

1 reply

Victor20x Mar 19, 2024
Author

Great, many thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for structuring tsv data prior to conversion #549

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Best practices for structuring tsv data prior to conversion #549

Victor20x Mar 15, 2024

Replies: 1 comment · 1 reply

ilius Mar 15, 2024 Maintainer

Victor20x Mar 19, 2024 Author

Victor20x
Mar 15, 2024

Replies: 1 comment 1 reply

ilius
Mar 15, 2024
Maintainer

Victor20x Mar 19, 2024
Author