-
Notifications
You must be signed in to change notification settings - Fork 2
Description
My initial concern was to find out how a tmesis is encoded in conllup. So I searched my favourite LASLA APN file for an occurence of "ante <quam>" (the usual way to code a form of "antequam" split into two words, I have to use here the < and > equivalents because these signs do not show up in the preview if I use their ASCII form). It occurs, for instance, in Caesar_BellumCivile_CaesBC1.APN in line 137 (from the repos https://dataverse.uliege.be/dataset.xhtml?persistentId=doi:10.58119/ULG/QJJ0SA) :
A03&0007# ante <quam> 1,2,2 0
The second half appearing in line 144 as :
A03&0007ANTEQVAM <ante> quam 1,2,2 T
Then I went to sentence 7 in Caesar_BellumCivile_CaesBC1.conllup. The relevant part being : "... ante de ea re ad senatum referri quam dilectus tota Italia habiti...". The corresponding words "ante" and "quam" have the analyses :
15 ante ante ADV M Degree=Pos _ _ _ CitationHierarchy=Liber_1,Capitulum_2,Paragraphus_2 i CaesBC1-A-03-7 140 lilaLemma:89165 TokenURI=http://lila-erc.eu/data/corpora/Lasla/id/corpus/CaesarBellum%20Civile/Caesar_BellumCivile_CaesBC1.BPN_t_0000140
...
22 quam quam SCONJ T _ _ _ _ CitationHierarchy=Liber_1,Capitulum_2,Paragraphus_2 i CaesBC1-A-03-7 147 lilaLemma:90912 TokenURI=http://lila-erc.eu/data/corpora/Lasla/id/corpus/CaesarBellum%20Civile/Caesar_BellumCivile_CaesBC1.BPN_t_0000147
Surprizingly, the positions in the LASLA file are 140 and 147, instead of 137 and 144. This is due to a second type of errors that will be described later. In these two lines, we see that the third field (LEMMA) is now "ante" and "quam". However, the lilaLemma:90912 corresponds to "antequam", but the spurious lemma "ante" = lilaLemma:89165 is an insult to the philologist that had annotated this text at LASLA.
The shift in items numbering is due to the erroneus lemmatisation of one "re publica" and two "rei publicae" that appear in the file before the "antequam" I was looking for. As a matter of fact, "rei publicae" is a single item in APN/BPN :
A03&0004RESPVBLICA rei publicae 1,1,2 A551
A03=0004QVE que 1,1,2 S
But it becomes two items in the conllup :
5 rei res NOUN A5 Case=Dat|Gender=Fem|InflClass=IndEurE|Number=Sing _ _ _ CitationHierarchy=Liber_1,Capitulum_1,Paragraphus_2 n5 CaesBC1-A-03-4 38 lilaLemma:121868 TokenURI=http://lila-erc.eu/data/corpora/Lasla/id/corpus/CaesarBellum%20Civile/Caesar_BellumCivile_CaesBC1.BPN_t_0000038
6-7 publicaeque _ _ _ _ _ _ _ _ _ _ _
6 publicae publicus ADJ C1 Case=Dat|Degree=Pos|Gender=Fem|InflClass=IndEurA|Number=Sing _ _ _ CitationHierarchy=Liber_1,Capitulum_1,Paragraphus_2 n6 CaesBC1-A-03-4 39 lilaLemma:120358 TokenURI=http://lila-erc.eu/data/corpora/Lasla/id/corpus/CaesarBellum%20Civile/Caesar_BellumCivile_CaesBC1.BPN_t_0000039
7 que que CCONJ S _ _ _ _ CitationHierarchy=Liber_1,Capitulum_1,Paragraphus_2 i CaesBC1-A-03-4 40 lilaLemma:131416 TokenURI=http://lila-erc.eu/data/corpora/Lasla/id/corpus/CaesarBellum%20Civile/Caesar_BellumCivile_CaesBC1.BPN_t_0000040
Here, no hint at all of the presence of the lemma "respublica" : just a noun "res" and an adjective "publicus".
As the v.2 of CoNLL-U allows space characters is the FORM field (https://universaldependencies.org/v2/conll-u.html), the proper encoding should probably have been :
5-6 rei publicaeque _ _ _
5 rei publicae respublica NOUN A5
6 que que CCONJ S
I have truncated the lines mainly because I don't know which "respublica" identifier I should choose : LiLa has two of them 76484 (first decl.) and 122962 (invariable !?!).
What a hell are these conllup files? The title and the filenames suggest that they are a conversion of the original BPN files, but they are not! A mere unchecked relemmatisation of a reconstructed text ?
Yours,
Philippe.
PS : "knowledge" is better with the "n" and "w" in the right order (twice in the readme file).