This repository has been archived by the owner on Oct 8, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
112 lines (77 loc) · 3.37 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
NAME
Lingua::RU::OpenCorpora::Tokenizer - tokenizer for OpenCorpora project
SYNOPSIS
my $tokens = $tokenizer->tokens($text);
my $bounds = $tokenizer->bounds($text);
DESCRIPTION
This module tokenizes input texts in Russian language.
Note that it uses probabilistic algorithm rather than trying to parse
the language. It also uses some pre-calculated data freely provided by
OpenCorpora project.
NOTE: OpenCorpora periodically provides updates for this data. Checkout
"opencorpora-update-tokenizer" script that comes with this distribution.
The algorithm is this:
1. Split text into chars.
2. Iterate over the chars from left to right.
3. For every char get its context (see CONTEXT).
4. Find likelihood for the context in vectors file (see "VECTORS FILE")
or use the default value - 0.5.
CONTEXT
See Lingua::RU::OpenCorpora::Tokenizer::Context.
VECTORS FILE
Contains a list of vectors with likelihood values showing the chance
that given vector is a token boundary.
Built by OpenCorpora project from semi-automatically annotated corpus.
HYPHENS FILE
Contains a list of hyphenated Russian words. Used in vectors
calculations.
Built by OpenCorpora project from semi-automatically annotated corpus.
EXCEPTIONS FILE
Contains a list of char sequences that are not subjects to tokenizing.
Built by OpenCorpora project from semi-automatically annotated corpus.
PREFIXES FILE
Contains a list of common prefixes for decompound words.
Built by OpenCorpora project from semi-automatically annotated corpus.
NOTE: all files are stored as GZip archives and are not supposed to be
edited manually.
METHODS
new($args)
Constructs and initializes new tokenizer object.
Takes a hashref as an argument with the folowwing keys:
data_dir
Path to a directory with OpenCorpora data. Optional. Defaults to
distribution directory (see File::ShareDir).
prefixes, hyphens, exceptions, vectors
Data objects. Optional. You can provide any of those (or none of
them). Default is to create an object from the data that comes with
the distribution.
tokens($text [, $options])
Takes text as input and splits it into tokens. Returns a reference to an
array of tokens.
You can also pass a hashref with options as a second argument. Current
options:
threshold
Minimal likelihood value for tokens boundary. Boundaries with lower
likelihood are excluded from consideration.
Default value is 1, which makes tokenizer do splitting only when
it's confident.
tokens_bounds($text)
Takes text as input and finds bounds of tokens in the text. It doesn't
split the text into tokens, it just marks where tokens could be.
Returns an arrayref of arrayrefs. Inner arrayref consists of two
elements: boundary position in text and likelihood.
bounds($text)
Convenience alias for "tokens_bounds()".
TO DO
get rid of gzipped files
KNOWN BUGS
version 0.07 introduced a small regression in F1 score (using
OpenCorpora data)
SEE ALSO
Lingua::RU::OpenCorpora::Tokenizer::Updater
<http://mathlingvo.ru/nlpseminar/archive/s_49>
AUTHOR
OpenCorpora.org team <http://opencorpora.org>
LICENSE
This program is free software, you can redistribute it under the same
terms as Perl itself.