From 1dc92a606f874a4fe52603803364cc1d90f952fb Mon Sep 17 00:00:00 2001
From: taku CRF++ is a simple, customizable, and open source
- implementation of Conditional Random Fields (CRFs)
- for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as
- Named Entity Recognition, Information Extraction and Text Chunking.
-
- Both the training file and the test file need to be in a
- particular format for CRF++ to work properly.
- Generally speaking, training and test file must consist of
- multiple tokens. In addition, a token
- consists of multiple (but fixed-numbers) columns. The
- definition of tokens depends on tasks, however, in
- most of typical cases, they simply correspond to
- words. Each token must be represented in one line,
- with the columns separated by white space (spaces or
- tabular characters). A sequence of token becomes a
- sentence. To identify the boundary between
- sentences, an empty line is put. You can give as many columns as you like, however the
- number of columns must be fixed through all tokens.
- Furthermore, there are some kinds of "semantics" among the
- columns. For example, 1st column is 'word', second column
- is 'POS tag' third column is 'sub-category of POS' and so
- on. The last column represents a true answer tag which is going
- to be trained by CRF. Here's an example of such a file: (data for CoNLL shared
- task) There are 3 columns for each token. The following data is invalid, since the number of
- columns of second and third are 2. (They have no POS
- column.) The number of columns should be fixed.
- As CRF++ is designed as a general purpose tool, you have to
- specify the feature templates in advance. This file describes
- which features are used in training and testing.
-
- Each line in the template file denotes one template.
- In each template, special macro %x[row,col] will be
- used to specify a token in the input data. row specfies the
- relative position from the current focusing token
- and col specifies the absolute position of the column.
- Here you can find some examples for the replacements
-CRF++: Yet Another CRF toolkit
-
- Introduction
-
- Table of contents
-
-
-
- Features
-
-
-
-
- News
-
-
- 2013-02-13: CRF++ 0.58 Released
-
-
-
-
-
- 2012-03-25
-
-
-
- 2012-02-24
-
-
-
-
- 2012-02-15: CRF++ 0.55
-
-
-
- 2010-05-16: CRF++ 0.54 Released
-
-
-
- 2009-05-06: CRF++ 0.5 Released
-
-
-
- 2009-04-19: CRF++ 0.52
-
-
-
- 2007-07-12: CRF++ 0.51
-
-
-
- 2007-12-09: CRF++ 0.50
-
-
-
- 2007-08-18: CRF++ 0.49
-
-
-
- 2007-07-07: CRF++ 0.48 Released
-
-
-
- 2007-03-07: CRF++ 0.47 Released
-
-
-
- 2007-02-12: CRF++ 0.46 Released
-
-
-
- 2006-11-26: CRF++ 0.45
-
-
-
- 2006-08-18: CRF++ 0.44
-
-
- 2006-08-07: CRF++ 0.43
-
-
-
- 2006-03-31: CRF++ 0.42
-
-
-
- 2006-03-30: CRF++ 0.41
-
-
-
- 2006-03-21: CRF++ 0.40
-
-
-
- 2005-10-29: CRF++ 0.3
-
-
-
- 2005-07-04: CRF++ 0.2
- Released
-
-
-
- 2005-05-28: CRF++ 0.1
- Released
-
-
- Download
-
-
-
-
- Source
-
-
-
- Binary package for MS-Windows
-
-
-
-
- Installation
-
-
-
-
-
-
-
-% ./configure
-% make
-% su
-# make install
-
- You can change default install path by using --prefix
- option of configure script.
- Try --help option for finding out other options.
- Usage
-
- Training and Test file formats
-
-
-He PRP B-NP
-reckons VBZ B-VP
-the DT B-NP
-current JJ I-NP
-account NN I-NP
-deficit NN I-NP
-will MD B-VP
-narrow VB I-VP
-to TO B-PP
-only RB B-NP
-# # I-NP
-1.8 CD I-NP
-billion CD I-NP
-in IN B-PP
-September NNP B-NP
-. . O
-
-He PRP B-NP
-reckons VBZ B-VP
-..
-
-
-
-
-
-
-He PRP B-NP
-reckons B-VP
-the B-NP
-current JJ I-NP
-account NN I-NP
-..
-
-
- Preparing feature templates
-
-
-Input: Data
-He PRP B-NP
-reckons VBZ B-VP
-the DT B-NP << CURRENT TOKEN
-current JJ I-NP
-account NN I-NP
-
-
-
-
-
-
-template
-expanded feature
-
-
-%x[0,0]
-the
-
-
-%x[0,1]
-DT
-
-
-%x[-1,0]
-reckons
-
-
-%x[-2,1]
-PRP
-
-
-%x[0,0]/%x[0,1]
-the/DT
-
-
-ABC%x[0,1]123
-ABCDT123
-
Note also that there are two types of templates. - The types are specified with the first character of templates. -
-- This is a template to describe unigram features. - When you give a template "U01:%x[0,1]", CRF++ automatically - generates a set of feature functions (func1 ... funcN) like: -
- --func1 = if (output = B-NP and feature="U01:DT") return 1 else return 0 -func2 = if (output = I-NP and feature="U01:DT") return 1 else return 0 -func3 = if (output = O and feature="U01:DT") return 1 else return 0 -.... -funcXX = if (output = B-NP and feature="U01:NN") return 1 else return 0 -funcXY = if (output = O and feature="U01:NN") return 1 else return 0 -...- -
- The number of feature functions generated by a template amounts to - (L * N), where L is the number of output classes and N is the - number of unique string expanded from the given template. -
- -- This is a template to describe bigram features. - With this template, a combination of the current output token and previous output token - (bigram) is automatically generated. Note that this type of template generates a total of - (L * L * N) distinct features, where L is the - number of output classes and N is the number - of unique features generated by the templates. - When the number of classes is large, this type of templates would produce - a tons of distinct features that would cause inefficiency both - in training/testing. -
- -- The words unigram/bigram are confusing, since a macro for unigram-features - does allow you to write word-level bigram like %x[-1,0]%x[0,0]. Here, - unigram and bigram features mean uni/bigrams of output tags.
--You also need to put an identifier in templates when relative positions of -tokens must be distinguished. -
--In the following case, the macro "%x[-2,1]" and "%x[1,1]" will be replaced -into "DT". But they indicates different "DT". -
--The DT B-NP -pen NN I-NP -is VB B-VP << CURRENT TOKEN -a DT B-NP -- -
To distinguish both two, put an unique identifier (U01: or U02:) in the -template:
--U01:%x[-2,1] -U02:%x[1,1] --
-In this case both two templates are regarded as different ones, as -they are expanded into different features, "U01:DT" and "U02:DT". -You can use any identifier whatever you like, but -it is useful to use numerical numbers to manage them, because they simply -correspond to feature IDs. -
- --If you want to use "bag-of-words" feature, in other words, -not to care the relative position of features, You don't need to -put such identifiers. -
- -Here is the template example for CoNLL 2000 shared task and Base-NP chunking -task. Only one bigram template ('B') is used. This means that -only combinations of previous output token and current token are -used as bigram features. The lines starting from # or empty lines are -discarded as comments
--# Unigram -U00:%x[-2,0] -U01:%x[-1,0] -U02:%x[0,0] -U03:%x[1,0] -U04:%x[2,0] -U05:%x[-1,0]/%x[0,0] -U06:%x[0,0]/%x[1,0] - -U10:%x[-2,1] -U11:%x[-1,1] -U12:%x[0,1] -U13:%x[1,1] -U14:%x[2,1] -U15:%x[-2,1]/%x[-1,1] -U16:%x[-1,1]/%x[0,1] -U17:%x[0,1]/%x[1,1] -U18:%x[1,1]/%x[2,1] - -U20:%x[-2,1]/%x[-1,1]/%x[0,1] -U21:%x[-1,1]/%x[0,1]/%x[1,1] -U22:%x[0,1]/%x[1,1]/%x[2,1] - -# Bigram -B -- - - - -
Use crf_learn command: -
-% crf_learn template_file train_file model_file --
-where template_file and train_file -are the files you need to prepare in advance. -crf_learn generates the trained model file in -model_file. -
- -crf_learn outputs the following information.
--CRF++: Yet Another CRF Tool Kit -Copyright(C) 2005 Taku Kudo, All rights reserved. - -reading training data: 100.. 200.. 300.. 400.. 500.. 600.. 700.. 800.. -Done! 1.94 s - -Number of sentences: 823 -Number of features: 1075862 -Number of thread(s): 1 -Freq: 1 -eta: 0.00010 -C: 1.00000 -shrinking size: 20 -Algorithm: CRF - -iter=0 terr=0.99103 serr=1.00000 obj=54318.36623 diff=1.00000 -iter=1 terr=0.35260 serr=0.98177 obj=44996.53537 diff=0.17161 -iter=2 terr=0.35260 serr=0.98177 obj=21032.70195 diff=0.53257 -iter=3 terr=0.23879 serr=0.94532 obj=13642.32067 diff=0.35138 -iter=4 terr=0.15324 serr=0.88700 obj=8985.70071 diff=0.34134 -iter=5 terr=0.11605 serr=0.80680 obj=7118.89846 diff=0.20775 -iter=6 terr=0.09305 serr=0.72175 obj=5531.31015 diff=0.22301 -iter=7 terr=0.08132 serr=0.68408 obj=4618.24644 diff=0.16507 -iter=8 terr=0.06228 serr=0.59174 obj=3742.93171 diff=0.18953 -- -
-There are 4 major parameters to control the training condition -
Here is the example where these two parameters are used.
--% crf_learn -f 3 -c 1.5 template_file train_file model_file --
Since version 0.45, CRF++ supports single-best MIRA training. -MIRA training is used when -a MIRA option is set. -
-% crf_learn -a MIRA template train.data model -CRF++: Yet Another CRF Tool Kit -Copyright(C) 2005 Taku Kudo, All rights reserved. - -reading training data: 100.. 200.. 300.. 400.. 500.. 600.. 700.. 800.. -Done! 1.92 s - -Number of sentences: 823 -Number of features: 1075862 -Number of thread(s): 1 -Freq: 1 -eta: 0.00010 -C: 1.00000 -shrinking size: 20 -Algorithm: MIRA - -iter=0 terr=0.11381 serr=0.74605 act=823 uact=0 obj=24.13498 kkt=28.00000 -iter=1 terr=0.04710 serr=0.49818 act=823 uact=0 obj=35.42289 kkt=7.60929 -iter=2 terr=0.02352 serr=0.30741 act=823 uact=0 obj=41.86775 kkt=5.74464 -iter=3 terr=0.01836 serr=0.25881 act=823 uact=0 obj=47.29565 kkt=6.64895 -iter=4 terr=0.01106 serr=0.17011 act=823 uact=0 obj=50.68792 kkt=3.81902 -iter=5 terr=0.00610 serr=0.10085 act=823 uact=0 obj=52.58096 kkt=3.98915 -iter=0 terr=0.11381 serr=0.74605 act=823 uact=0 obj=24.13498 kkt=28.00000 -... -- -
There are some parameters to control the MIRA training condition
-Use crf_test command: -
-% crf_test -m model_file test_files ... --
-where model_file is the file crf_learncreates. -In the testing, you don't need to specify the template file, -because the model file has the same information for the template. -test_file is the test data you want to assign sequential tags. -This file has to be written in the same format as training file. -
- - --Here is an output of crf_test:
- --% crf_test -m model test.data -Rockwell NNP B B -International NNP I I -Corp. NNP I I -'s POS B B -Tulsa NNP I I -unit NN I I -.. -- -
The last column is given (estimated) tag. -If the 3rd column is true answer tag , you can evaluate the accuracy -by simply seeing the difference between the 3rd and 4th columns.
- - -The -v option sets verbose level. default -value is 0. By increasing the level, you can have an -extra information from CRF++
- --% crf_test -v1 -m model test.data| head -# 0.478113 -Rockwell NNP B B/0.992465 -International NNP I I/0.979089 -Corp. NNP I I/0.954883 -'s POS B B/0.986396 -Tulsa NNP I I/0.991966 -... --
-The first line "# 0.478113" shows the conditional probably for the output. -Also, each output tag has a probability represented like "B/0.992465". -
- -You can also have marginal probabilities for all other candidates.
--% crf_test -v2 -m model test.data -# 0.478113 -Rockwell NNP B B/0.992465 B/0.992465 I/0.00144946 O/0.00608594 -International NNP I I/0.979089 B/0.0105273 I/0.979089 O/0.0103833 -Corp. NNP I I/0.954883 B/0.00477976 I/0.954883 O/0.040337 -'s POS B B/0.986396 B/0.986396 I/0.00655976 O/0.00704426 -Tulsa NNP I I/0.991966 B/0.00787494 I/0.991966 O/0.00015949 -unit NN I I/0.996169 B/0.00283111 I/0.996169 O/0.000999975 -.. --
-With the -n option, you can obtain N-best results -sorted by the conditional probability of CRF. -With n-best output mode, CRF++ first gives one additional line like "# N prob", where N means that -rank of the output starting from 0 and prob denotes the conditional -probability for the output.
- -Note that CRF++ sometimes -discards enumerating N-best results if it cannot find candidates any -more. This is the case when you give CRF++ a short -sentence.
- -CRF++ uses a combination of forward Viterbi and backward A* search. This combination -yields the exact list of n-best results.
- -Here is the example of the N-best results.
--% crf_test -n 20 -m model test.data -# 0 0.478113 -Rockwell NNP B B -International NNP I I -Corp. NNP I I -'s POS B B -... - -# 1 0.194335 -Rockwell NNP B B -International NNP I I -- - - -
- In the example directories, you can find three case studies, baseNP - chunking, Text Chunking, and Japanese named entity recognition, to use CRF++. -
- -- In each directory, please try the following commands -
- -% crf_learn template train model - % crf_test -m model test- -
$Id: index.html,v 1.23 2003/01/06 13:11:21 taku-ku Exp - $;
- - - taku@chasen.org - - - - diff --git a/doc/doxygen/annotated.html b/doxygen/annotated.html similarity index 100% rename from doc/doxygen/annotated.html rename to doxygen/annotated.html diff --git a/doc/doxygen/classCRFPP_1_1Model-members.html b/doxygen/classCRFPP_1_1Model-members.html similarity index 100% rename from doc/doxygen/classCRFPP_1_1Model-members.html rename to doxygen/classCRFPP_1_1Model-members.html diff --git a/doc/doxygen/classCRFPP_1_1Model.html b/doxygen/classCRFPP_1_1Model.html similarity index 100% rename from doc/doxygen/classCRFPP_1_1Model.html rename to doxygen/classCRFPP_1_1Model.html diff --git a/doc/doxygen/classCRFPP_1_1Tagger-members.html b/doxygen/classCRFPP_1_1Tagger-members.html similarity index 100% rename from doc/doxygen/classCRFPP_1_1Tagger-members.html rename to doxygen/classCRFPP_1_1Tagger-members.html diff --git a/doc/doxygen/classCRFPP_1_1Tagger.html b/doxygen/classCRFPP_1_1Tagger.html similarity index 100% rename from doc/doxygen/classCRFPP_1_1Tagger.html rename to doxygen/classCRFPP_1_1Tagger.html diff --git a/doc/doxygen/classes.html b/doxygen/classes.html similarity index 100% rename from doc/doxygen/classes.html rename to doxygen/classes.html diff --git a/doc/doxygen/crfpp_8h-source.html b/doxygen/crfpp_8h-source.html similarity index 100% rename from doc/doxygen/crfpp_8h-source.html rename to doxygen/crfpp_8h-source.html diff --git a/doc/doxygen/crfpp_8h.html b/doxygen/crfpp_8h.html similarity index 100% rename from doc/doxygen/crfpp_8h.html rename to doxygen/crfpp_8h.html diff --git a/doc/doxygen/crfpp_8h_source.html b/doxygen/crfpp_8h_source.html similarity index 100% rename from doc/doxygen/crfpp_8h_source.html rename to doxygen/crfpp_8h_source.html diff --git a/doc/doxygen/doxygen.css b/doxygen/doxygen.css similarity index 100% rename from doc/doxygen/doxygen.css rename to doxygen/doxygen.css diff --git a/doc/doxygen/doxygen.png b/doxygen/doxygen.png similarity index 100% rename from doc/doxygen/doxygen.png rename to doxygen/doxygen.png diff --git a/doc/doxygen/files.html b/doxygen/files.html similarity index 100% rename from doc/doxygen/files.html rename to doxygen/files.html diff --git a/doc/doxygen/functions.html b/doxygen/functions.html similarity index 100% rename from doc/doxygen/functions.html rename to doxygen/functions.html diff --git a/doc/doxygen/functions_func.html b/doxygen/functions_func.html similarity index 100% rename from doc/doxygen/functions_func.html rename to doxygen/functions_func.html diff --git a/doc/doxygen/globals.html b/doxygen/globals.html similarity index 100% rename from doc/doxygen/globals.html rename to doxygen/globals.html diff --git a/doc/doxygen/globals_func.html b/doxygen/globals_func.html similarity index 100% rename from doc/doxygen/globals_func.html rename to doxygen/globals_func.html diff --git a/doc/doxygen/globals_type.html b/doxygen/globals_type.html similarity index 100% rename from doc/doxygen/globals_type.html rename to doxygen/globals_type.html diff --git a/doc/doxygen/index.html b/doxygen/index.html similarity index 100% rename from doc/doxygen/index.html rename to doxygen/index.html diff --git a/doc/doxygen/namespaceCRFPP.html b/doxygen/namespaceCRFPP.html similarity index 100% rename from doc/doxygen/namespaceCRFPP.html rename to doxygen/namespaceCRFPP.html diff --git a/doc/doxygen/namespacemembers.html b/doxygen/namespacemembers.html similarity index 100% rename from doc/doxygen/namespacemembers.html rename to doxygen/namespacemembers.html diff --git a/doc/doxygen/namespacemembers_func.html b/doxygen/namespacemembers_func.html similarity index 100% rename from doc/doxygen/namespacemembers_func.html rename to doxygen/namespacemembers_func.html diff --git a/doc/doxygen/namespaces.html b/doxygen/namespaces.html similarity index 100% rename from doc/doxygen/namespaces.html rename to doxygen/namespaces.html diff --git a/doc/doxygen/tab_b.gif b/doxygen/tab_b.gif similarity index 100% rename from doc/doxygen/tab_b.gif rename to doxygen/tab_b.gif diff --git a/doc/doxygen/tab_l.gif b/doxygen/tab_l.gif similarity index 100% rename from doc/doxygen/tab_l.gif rename to doxygen/tab_l.gif diff --git a/doc/doxygen/tab_r.gif b/doxygen/tab_r.gif similarity index 100% rename from doc/doxygen/tab_r.gif rename to doxygen/tab_r.gif diff --git a/doc/doxygen/tabs.css b/doxygen/tabs.css similarity index 100% rename from doc/doxygen/tabs.css rename to doxygen/tabs.css diff --git a/index.html b/index.html index 802992c..6e5ea17 100644 --- a/index.html +++ b/index.html @@ -1 +1,831 @@ -Hello world + + + + + +CRF++ is a simple, customizable, and open source + implementation of Conditional Random Fields (CRFs) + for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as + Named Entity Recognition, Information Extraction and Text Chunking. + +
+% ./configure +% make +% su +# make install ++ You can change default install path by using --prefix + option of configure script.
Both the training file and the test file need to be in a + particular format for CRF++ to work properly. + Generally speaking, training and test file must consist of + multiple tokens. In addition, a token + consists of multiple (but fixed-numbers) columns. The + definition of tokens depends on tasks, however, in + most of typical cases, they simply correspond to + words. Each token must be represented in one line, + with the columns separated by white space (spaces or + tabular characters). A sequence of token becomes a + sentence. To identify the boundary between + sentences, an empty line is put.
+ +You can give as many columns as you like, however the + number of columns must be fixed through all tokens. + Furthermore, there are some kinds of "semantics" among the + columns. For example, 1st column is 'word', second column + is 'POS tag' third column is 'sub-category of POS' and so + on.
+ +The last column represents a true answer tag which is going + to be trained by CRF.
+ +Here's an example of such a file: (data for CoNLL shared + task)
++He PRP B-NP +reckons VBZ B-VP +the DT B-NP +current JJ I-NP +account NN I-NP +deficit NN I-NP +will MD B-VP +narrow VB I-VP +to TO B-PP +only RB B-NP +# # I-NP +1.8 CD I-NP +billion CD I-NP +in IN B-PP +September NNP B-NP +. . O + +He PRP B-NP +reckons VBZ B-VP +.. ++ +
There are 3 columns for each token.
+ +The following data is invalid, since the number of + columns of second and third are 2. (They have no POS + column.) The number of columns should be fixed.
++He PRP B-NP +reckons B-VP +the B-NP +current JJ I-NP +account NN I-NP +.. ++ +
+ As CRF++ is designed as a general purpose tool, you have to + specify the feature templates in advance. This file describes + which features are used in training and testing. +
+ ++ Each line in the template file denotes one template. + In each template, special macro %x[row,col] will be + used to specify a token in the input data. row specfies the + relative position from the current focusing token + and col specifies the absolute position of the column. +
+ +Here you can find some examples for the replacements
++Input: Data +He PRP B-NP +reckons VBZ B-VP +the DT B-NP << CURRENT TOKEN +current JJ I-NP +account NN I-NP ++ +
+
template | +expanded feature | +
%x[0,0] | +the | +
%x[0,1] | +DT | +
%x[-1,0] | +reckons | +
%x[-2,1] | +PRP | +
%x[0,0]/%x[0,1] | +the/DT | +
ABC%x[0,1]123 | +ABCDT123 | +
Note also that there are two types of templates. + The types are specified with the first character of templates. +
++ This is a template to describe unigram features. + When you give a template "U01:%x[0,1]", CRF++ automatically + generates a set of feature functions (func1 ... funcN) like: +
+ ++func1 = if (output = B-NP and feature="U01:DT") return 1 else return 0 +func2 = if (output = I-NP and feature="U01:DT") return 1 else return 0 +func3 = if (output = O and feature="U01:DT") return 1 else return 0 +.... +funcXX = if (output = B-NP and feature="U01:NN") return 1 else return 0 +funcXY = if (output = O and feature="U01:NN") return 1 else return 0 +...+ +
+ The number of feature functions generated by a template amounts to + (L * N), where L is the number of output classes and N is the + number of unique string expanded from the given template. +
+ ++ This is a template to describe bigram features. + With this template, a combination of the current output token and previous output token + (bigram) is automatically generated. Note that this type of template generates a total of + (L * L * N) distinct features, where L is the + number of output classes and N is the number + of unique features generated by the templates. + When the number of classes is large, this type of templates would produce + a tons of distinct features that would cause inefficiency both + in training/testing. +
+ ++ The words unigram/bigram are confusing, since a macro for unigram-features + does allow you to write word-level bigram like %x[-1,0]%x[0,0]. Here, + unigram and bigram features mean uni/bigrams of output tags.
++You also need to put an identifier in templates when relative positions of +tokens must be distinguished. +
++In the following case, the macro "%x[-2,1]" and "%x[1,1]" will be replaced +into "DT". But they indicates different "DT". +
++The DT B-NP +pen NN I-NP +is VB B-VP << CURRENT TOKEN +a DT B-NP ++ +
To distinguish both two, put an unique identifier (U01: or U02:) in the +template:
++U01:%x[-2,1] +U02:%x[1,1] ++
+In this case both two templates are regarded as different ones, as +they are expanded into different features, "U01:DT" and "U02:DT". +You can use any identifier whatever you like, but +it is useful to use numerical numbers to manage them, because they simply +correspond to feature IDs. +
+ ++If you want to use "bag-of-words" feature, in other words, +not to care the relative position of features, You don't need to +put such identifiers. +
+ +Here is the template example for CoNLL 2000 shared task and Base-NP chunking +task. Only one bigram template ('B') is used. This means that +only combinations of previous output token and current token are +used as bigram features. The lines starting from # or empty lines are +discarded as comments
++# Unigram +U00:%x[-2,0] +U01:%x[-1,0] +U02:%x[0,0] +U03:%x[1,0] +U04:%x[2,0] +U05:%x[-1,0]/%x[0,0] +U06:%x[0,0]/%x[1,0] + +U10:%x[-2,1] +U11:%x[-1,1] +U12:%x[0,1] +U13:%x[1,1] +U14:%x[2,1] +U15:%x[-2,1]/%x[-1,1] +U16:%x[-1,1]/%x[0,1] +U17:%x[0,1]/%x[1,1] +U18:%x[1,1]/%x[2,1] + +U20:%x[-2,1]/%x[-1,1]/%x[0,1] +U21:%x[-1,1]/%x[0,1]/%x[1,1] +U22:%x[0,1]/%x[1,1]/%x[2,1] + +# Bigram +B ++ + + + +
Use crf_learn command: +
+% crf_learn template_file train_file model_file ++
+where template_file and train_file +are the files you need to prepare in advance. +crf_learn generates the trained model file in +model_file. +
+ +crf_learn outputs the following information.
++CRF++: Yet Another CRF Tool Kit +Copyright(C) 2005 Taku Kudo, All rights reserved. + +reading training data: 100.. 200.. 300.. 400.. 500.. 600.. 700.. 800.. +Done! 1.94 s + +Number of sentences: 823 +Number of features: 1075862 +Number of thread(s): 1 +Freq: 1 +eta: 0.00010 +C: 1.00000 +shrinking size: 20 +Algorithm: CRF + +iter=0 terr=0.99103 serr=1.00000 obj=54318.36623 diff=1.00000 +iter=1 terr=0.35260 serr=0.98177 obj=44996.53537 diff=0.17161 +iter=2 terr=0.35260 serr=0.98177 obj=21032.70195 diff=0.53257 +iter=3 terr=0.23879 serr=0.94532 obj=13642.32067 diff=0.35138 +iter=4 terr=0.15324 serr=0.88700 obj=8985.70071 diff=0.34134 +iter=5 terr=0.11605 serr=0.80680 obj=7118.89846 diff=0.20775 +iter=6 terr=0.09305 serr=0.72175 obj=5531.31015 diff=0.22301 +iter=7 terr=0.08132 serr=0.68408 obj=4618.24644 diff=0.16507 +iter=8 terr=0.06228 serr=0.59174 obj=3742.93171 diff=0.18953 ++ +
+There are 4 major parameters to control the training condition +
Here is the example where these two parameters are used.
++% crf_learn -f 3 -c 1.5 template_file train_file model_file ++
Since version 0.45, CRF++ supports single-best MIRA training. +MIRA training is used when -a MIRA option is set. +
+% crf_learn -a MIRA template train.data model +CRF++: Yet Another CRF Tool Kit +Copyright(C) 2005 Taku Kudo, All rights reserved. + +reading training data: 100.. 200.. 300.. 400.. 500.. 600.. 700.. 800.. +Done! 1.92 s + +Number of sentences: 823 +Number of features: 1075862 +Number of thread(s): 1 +Freq: 1 +eta: 0.00010 +C: 1.00000 +shrinking size: 20 +Algorithm: MIRA + +iter=0 terr=0.11381 serr=0.74605 act=823 uact=0 obj=24.13498 kkt=28.00000 +iter=1 terr=0.04710 serr=0.49818 act=823 uact=0 obj=35.42289 kkt=7.60929 +iter=2 terr=0.02352 serr=0.30741 act=823 uact=0 obj=41.86775 kkt=5.74464 +iter=3 terr=0.01836 serr=0.25881 act=823 uact=0 obj=47.29565 kkt=6.64895 +iter=4 terr=0.01106 serr=0.17011 act=823 uact=0 obj=50.68792 kkt=3.81902 +iter=5 terr=0.00610 serr=0.10085 act=823 uact=0 obj=52.58096 kkt=3.98915 +iter=0 terr=0.11381 serr=0.74605 act=823 uact=0 obj=24.13498 kkt=28.00000 +... ++ +
There are some parameters to control the MIRA training condition
+Use crf_test command: +
+% crf_test -m model_file test_files ... ++
+where model_file is the file crf_learncreates. +In the testing, you don't need to specify the template file, +because the model file has the same information for the template. +test_file is the test data you want to assign sequential tags. +This file has to be written in the same format as training file. +
+ + ++Here is an output of crf_test:
+ ++% crf_test -m model test.data +Rockwell NNP B B +International NNP I I +Corp. NNP I I +'s POS B B +Tulsa NNP I I +unit NN I I +.. ++ +
The last column is given (estimated) tag. +If the 3rd column is true answer tag , you can evaluate the accuracy +by simply seeing the difference between the 3rd and 4th columns.
+ + +The -v option sets verbose level. default +value is 0. By increasing the level, you can have an +extra information from CRF++
+ ++% crf_test -v1 -m model test.data| head +# 0.478113 +Rockwell NNP B B/0.992465 +International NNP I I/0.979089 +Corp. NNP I I/0.954883 +'s POS B B/0.986396 +Tulsa NNP I I/0.991966 +... ++
+The first line "# 0.478113" shows the conditional probably for the output. +Also, each output tag has a probability represented like "B/0.992465". +
+ +You can also have marginal probabilities for all other candidates.
++% crf_test -v2 -m model test.data +# 0.478113 +Rockwell NNP B B/0.992465 B/0.992465 I/0.00144946 O/0.00608594 +International NNP I I/0.979089 B/0.0105273 I/0.979089 O/0.0103833 +Corp. NNP I I/0.954883 B/0.00477976 I/0.954883 O/0.040337 +'s POS B B/0.986396 B/0.986396 I/0.00655976 O/0.00704426 +Tulsa NNP I I/0.991966 B/0.00787494 I/0.991966 O/0.00015949 +unit NN I I/0.996169 B/0.00283111 I/0.996169 O/0.000999975 +.. ++
+With the -n option, you can obtain N-best results +sorted by the conditional probability of CRF. +With n-best output mode, CRF++ first gives one additional line like "# N prob", where N means that +rank of the output starting from 0 and prob denotes the conditional +probability for the output.
+ +Note that CRF++ sometimes +discards enumerating N-best results if it cannot find candidates any +more. This is the case when you give CRF++ a short +sentence.
+ +CRF++ uses a combination of forward Viterbi and backward A* search. This combination +yields the exact list of n-best results.
+ +Here is the example of the N-best results.
++% crf_test -n 20 -m model test.data +# 0 0.478113 +Rockwell NNP B B +International NNP I I +Corp. NNP I I +'s POS B B +... + +# 1 0.194335 +Rockwell NNP B B +International NNP I I ++ + + +
+ In the example directories, you can find three case studies, baseNP + chunking, Text Chunking, and Japanese named entity recognition, to use CRF++. +
+ ++ In each directory, please try the following commands +
+ +% crf_learn template train model + % crf_test -m model test+ +
$Id: index.html,v 1.23 2003/01/06 13:11:21 taku-ku Exp + $;
+ + + taku@chasen.org + + + + diff --git a/winmain.h b/winmain.h deleted file mode 100644 index 74d3a02..0000000 --- a/winmain.h +++ /dev/null @@ -1,69 +0,0 @@ -// -// CRF++ -- Yet Another CRF toolkit -// -// $Id: common.h 1588 2007-02-12 09:03:39Z taku $; -// -// Copyright(C) 2005-2007 Taku Kudo