Skip to content

Commit a480804

Browse files
committed
Update FAQ
1 parent ae05d86 commit a480804

File tree

1 file changed

+73
-35
lines changed

1 file changed

+73
-35
lines changed

docs/faq.rst

Lines changed: 73 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ FAQ
44
Is Wn related to the NLTK's `nltk.corpus.wordnet` module?
55
---------------------------------------------------------
66

7-
Only in spirit. There was an effort to develop the NLTK's module as a
7+
Only in spirit. There was an effort to develop the `NLTK`_\ 's module as a
88
standalone package (see https://github.com/nltk/wordnet/), but
99
development had slowed. Wn has the same broad goals and a similar API
1010
as that standalone package, but fundamental architectural differences
@@ -19,51 +19,89 @@ Is Wn compatible with the NLTK's module?
1919
The API is intentionally similar, but not exactly the same (for
2020
instance see the next question), and there are differences in the ways
2121
that results are retrieved, particularly for non-English wordnets. See
22-
:doc:`guides/nltk-migration` for more information.
22+
:doc:`guides/nltk-migration` for more information. Also see
23+
:ref:`princeton-wordnet`.
2324

2425
Where are the ``Lemma`` objects? What are ``Word`` and ``Sense`` objects?
2526
-------------------------------------------------------------------------
2627

27-
While senses are essentially links between words (also called "lexical
28-
entries") and synsets, they may contain metadata and be the source or
29-
target of sense relations, so in some ways they are more like nodes
30-
than edges when the wordnet is viewed as a graph. The NLTK chose to
31-
conflate words and senses into a single object called a ``Lemma``, but
32-
Wn keeps them separate. Wn also has an unrelated concept called a
33-
"lemma", but it is merely the canonical form of a word.
28+
Unlike the original `WNDB`_ data format of the original WordNet, the
29+
`WN-LMF`_ XML format grants words (called *lexical entries* in WN-LMF
30+
and a :class:`~wn.Word` object in Wn) and word senses
31+
(:class:`~wn.Sense` in Wn) explicit, first-class status alongside
32+
synsets. While senses are essentially links between words and
33+
synsets, they may contain metadata and be the source or target of
34+
sense relations, so in some ways they are more like nodes than edges
35+
when the wordnet is viewed as a graph. The `NLTK`_\ 's module, using
36+
the WNDB format, combines the information of a word and a sense into a
37+
single object called a ``Lemmas``. Wn also has an unrelated concept
38+
called a :meth:`~wn.Word.lemma`, but it is merely the canonical form
39+
of a word.
40+
41+
.. _princeton-wordnet:
42+
43+
Where is the Princeton WordNet data?
44+
------------------------------------
45+
46+
The original English wordnet, named simply *WordNet* but often
47+
referred to as the *Princeton WordNet* to better distinguish it from
48+
other projects, is specifically the data distributed by Princeton in
49+
the `WNDB`_ format. The `Open Multilingual Wordnet <OMW_>`_ (OMW)
50+
packages an export of the WordNet data as the *OMW English Wordnet
51+
based on WordNet 3.0* which is used by Wn (with the lexicon ID
52+
``omw-en``). It also has a similar export for WordNet 3.1 data
53+
(``omw-en31``). Both of these are highly compatible with the original
54+
data and can be used as drop-in replacements.
55+
56+
Prior to Wn version 0.9 (and, correspondingly, prior to the `OMW
57+
data`_ version 1.4), the ``pwn:3.0`` and ``pwn:3.1`` English wordnets
58+
distributed by OMW were incorrectly called the *Princeton WordNet*
59+
(for WordNet 3.0 and 3.1, respectively). From Wn version 0.9 (and from
60+
version 1.4 of the OMW data), these are called the *OMW English
61+
Wordnet based on WordNet 3.0/3.1* (``omw-en:1.4`` and
62+
``omw-en31:1.4``, respectively). These lexicons are intentionally
63+
compatible with the original WordNet data, and the 1.4 versions are
64+
even more compatible than the previous ``pwn:3.0`` and ``pwn:3.1``
65+
lexicons, so it is strongly recommended to use them over the previous
66+
versions.
67+
68+
.. _OMW data: https://github.com/omwn/omw-data
3469

3570
Why don't all wordnets share the same synsets?
3671
----------------------------------------------
3772

38-
The `Open Multilingual Wordnet <https://lr.soh.ntu.edu.sg/omw/omw>`_
39-
(OMW) contains wordnets for many languages created using the *expand*
40-
methodology [VOSSEN1998]_, where non-English wordnets provide words on
41-
top of the Princeton WordNet's synset structure. This allows new
42-
wordnets to be built in much less time than starting from scratch, but
43-
with a few drawbacks, such as that words cannot be added if they do
44-
not have a synset in the Princeton WordNet, and that it is difficult
45-
to version the wordnets independently (e.g., for reproducibility of
46-
experiments involving wordnet data) as all are interconnected. Wn,
47-
therefore, creates new synsets for each wordnet added to its database,
48-
and synsets then specify which resource they belong to. Queries can
49-
specify which resources may be examined. Also see
50-
:doc:`guides/interlingual`.
73+
The `Open Multilingual Wordnet <OMW_>`_ (OMW) contains wordnets for
74+
many languages created using the *expand* methodology [VOSSEN1998]_,
75+
where non-English wordnets provide words on top of the English
76+
wordnet's synset structure. This allows new wordnets to be built in
77+
much less time than starting from scratch, but with a few drawbacks,
78+
such as that words cannot be added if they do not have a synset in the
79+
English wordnet, and that it is difficult to version the wordnets
80+
independently (e.g., for reproducibility of experiments involving
81+
wordnet data) as all are interconnected. Wn, therefore, creates new
82+
synsets for each wordnet added to its database, and synsets then
83+
specify which resource they belong to. Queries can specify which
84+
resources may be examined. Also see :doc:`guides/interlingual`.
5185

5286
Why does Wn's database get so big?
5387
----------------------------------
5488

55-
The Princeton WordNet 3.0 takes about 104 MiB of disk space in Wn's
56-
database, which is only about 6 MiB more than it takes as a `WN-LMF
57-
XML <https://globalwordnet.github.io/schemas/>`_ file. The NLTK,
58-
however, uses the obsolete WNDB format which is more compact,
59-
requiring only 35 MiB of disk space. The difference with the Open
60-
Multilingual Wordnet 1.3 is more striking: it takes about 466 MiB of
61-
disk space in the database, but only 49 MiB in the NLTK. Part of the
62-
difference here is that the OMW files in the NLTK are simple
63-
tab-separated-value files listing only the words added to each synset
64-
for each language. In addition, Wn creates new synsets for each
65-
wordnet added (see the previous question). One more reason is that Wn
66-
creates various indexes in the database for efficient lookup.
67-
89+
The *OMW English Wordnet based on WordNet 3.0* takes about 114 MiB of
90+
disk space in Wn's database, which is only about 8 MiB more than it
91+
takes as a `WN-LMF`_ XML file. The `NLTK`_, however, uses the obsolete
92+
`WNDB`_ format which is more compact, requiring only 35 MiB of disk
93+
space. The difference with the Open Multilingual Wordnet 1.4 is more
94+
striking: it takes about 659 MiB of disk space in the database, but
95+
only 49 MiB in the NLTK. Part of the difference here is that the OMW
96+
files in the NLTK are simple tab-separated-value files listing only
97+
the words added to each synset for each language. In addition, Wn
98+
creates new synsets for each wordnet added (see the previous
99+
question). One more reason is that Wn creates various indexes in the
100+
database for efficient lookup.
68101

102+
.. _NLTK: https://www.nltk.org/
103+
.. _OMW: http://github.com/omwn
69104
.. [VOSSEN1998] Piek Vossen. 1998. *Introduction to EuroWordNet.* Computers and the Humanities, 32(2): 73--89.
105+
.. _Open English Wordnet 2021: https://en-word.net/
106+
.. _WNDB: https://wordnet.princeton.edu/documentation/wndb5wn
107+
.. _WN-LMF: https://globalwordnet.github.io/schemas/

0 commit comments

Comments
 (0)