diff --git a/doc/crfpp.cfg b/crfpp.cfg similarity index 100% rename from doc/crfpp.cfg rename to crfpp.cfg diff --git a/doc/default.css b/default.css similarity index 100% rename from doc/default.css rename to default.css diff --git a/doc/index.html b/doc/index.html deleted file mode 100644 index 6e5ea17..0000000 --- a/doc/index.html +++ /dev/null @@ -1,831 +0,0 @@ - - - - - - CRF++: Yet Another CRF toolkit - - - - -

CRF++: Yet Another CRF toolkit

- -

Introduction

- -

CRF++ is a simple, customizable, and open source - implementation of Conditional Random Fields (CRFs) - for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as - Named Entity Recognition, Information Extraction and Text Chunking. - -

Table of contents

- - -

Features

- - - -

News

- - -

Download

- - - -

Installation

- - - -

Usage

- -

Training and Test file formats

- -

Both the training file and the test file need to be in a - particular format for CRF++ to work properly. - Generally speaking, training and test file must consist of - multiple tokens. In addition, a token - consists of multiple (but fixed-numbers) columns. The - definition of tokens depends on tasks, however, in - most of typical cases, they simply correspond to - words. Each token must be represented in one line, - with the columns separated by white space (spaces or - tabular characters). A sequence of token becomes a - sentence. To identify the boundary between - sentences, an empty line is put.

- -

You can give as many columns as you like, however the - number of columns must be fixed through all tokens. - Furthermore, there are some kinds of "semantics" among the - columns. For example, 1st column is 'word', second column - is 'POS tag' third column is 'sub-category of POS' and so - on.

- -

The last column represents a true answer tag which is going - to be trained by CRF.

- -

Here's an example of such a file: (data for CoNLL shared - task)

-
-He        PRP  B-NP
-reckons   VBZ  B-VP
-the       DT   B-NP
-current   JJ   I-NP
-account   NN   I-NP
-deficit   NN   I-NP
-will      MD   B-VP
-narrow    VB   I-VP
-to        TO   B-PP
-only      RB   B-NP
-#         #    I-NP
-1.8       CD   I-NP
-billion   CD   I-NP
-in        IN   B-PP
-September NNP  B-NP
-.         .    O
-
-He        PRP  B-NP
-reckons   VBZ  B-VP
-..
-
- -

There are 3 columns for each token.

- - - -

The following data is invalid, since the number of - columns of second and third are 2. (They have no POS - column.) The number of columns should be fixed.

-
-He        PRP  B-NP
-reckons   B-VP
-the       B-NP
-current   JJ   I-NP
-account   NN   I-NP
-..
-
- -

Preparing feature templates

-

- As CRF++ is designed as a general purpose tool, you have to - specify the feature templates in advance. This file describes - which features are used in training and testing. -

- - - - - -

Training (encoding)

- -

Use crf_learn command: -

-% crf_learn template_file train_file model_file
-
-

-where template_file and train_file -are the files you need to prepare in advance. -crf_learn generates the trained model file in -model_file. -

- -

crf_learn outputs the following information.

-
-CRF++: Yet Another CRF Tool Kit
-Copyright(C) 2005 Taku Kudo, All rights reserved.
-
-reading training data: 100.. 200.. 300.. 400.. 500.. 600.. 700.. 800.. 
-Done! 1.94 s
-
-Number of sentences: 823
-Number of features:  1075862
-Number of thread(s): 1
-Freq:                1
-eta:                 0.00010
-C:                   1.00000
-shrinking size:      20
-Algorithm:           CRF
-
-iter=0 terr=0.99103 serr=1.00000 obj=54318.36623 diff=1.00000
-iter=1 terr=0.35260 serr=0.98177 obj=44996.53537 diff=0.17161
-iter=2 terr=0.35260 serr=0.98177 obj=21032.70195 diff=0.53257
-iter=3 terr=0.23879 serr=0.94532 obj=13642.32067 diff=0.35138
-iter=4 terr=0.15324 serr=0.88700 obj=8985.70071 diff=0.34134
-iter=5 terr=0.11605 serr=0.80680 obj=7118.89846 diff=0.20775
-iter=6 terr=0.09305 serr=0.72175 obj=5531.31015 diff=0.22301
-iter=7 terr=0.08132 serr=0.68408 obj=4618.24644 diff=0.16507
-iter=8 terr=0.06228 serr=0.59174 obj=3742.93171 diff=0.18953
-
- - - -

-There are 4 major parameters to control the training condition -

- -

Here is the example where these two parameters are used.

-
-% crf_learn -f 3 -c 1.5 template_file train_file model_file
-
-

Since version 0.45, CRF++ supports single-best MIRA training. -MIRA training is used when -a MIRA option is set. -

-% crf_learn -a MIRA template train.data model
-CRF++: Yet Another CRF Tool Kit
-Copyright(C) 2005 Taku Kudo, All rights reserved.
-
-reading training data: 100.. 200.. 300.. 400.. 500.. 600.. 700.. 800.. 
-Done! 1.92 s
-
-Number of sentences: 823
-Number of features:  1075862
-Number of thread(s): 1
-Freq:                1
-eta:                 0.00010
-C:                   1.00000
-shrinking size:      20
-Algorithm:           MIRA
-
-iter=0 terr=0.11381 serr=0.74605 act=823 uact=0 obj=24.13498 kkt=28.00000
-iter=1 terr=0.04710 serr=0.49818 act=823 uact=0 obj=35.42289 kkt=7.60929
-iter=2 terr=0.02352 serr=0.30741 act=823 uact=0 obj=41.86775 kkt=5.74464
-iter=3 terr=0.01836 serr=0.25881 act=823 uact=0 obj=47.29565 kkt=6.64895
-iter=4 terr=0.01106 serr=0.17011 act=823 uact=0 obj=50.68792 kkt=3.81902
-iter=5 terr=0.00610 serr=0.10085 act=823 uact=0 obj=52.58096 kkt=3.98915
-iter=0 terr=0.11381 serr=0.74605 act=823 uact=0 obj=24.13498 kkt=28.00000
-...
-
- - - -

There are some parameters to control the MIRA training condition

- - -

Testing (decoding)

- -

Use crf_test command: -

-% crf_test -m model_file test_files ...
-
-

-where model_file is the file crf_learncreates. -In the testing, you don't need to specify the template file, -because the model file has the same information for the template. -test_file is the test data you want to assign sequential tags. -This file has to be written in the same format as training file. -

- - -

-Here is an output of crf_test:

- -
-% crf_test -m model test.data
-Rockwell        NNP     B       B
-International   NNP     I       I
-Corp.   NNP     I       I
-'s      POS     B       B
-Tulsa   NNP     I       I
-unit    NN      I       I
-..
-
- -

The last column is given (estimated) tag. -If the 3rd column is true answer tag , you can evaluate the accuracy -by simply seeing the difference between the 3rd and 4th columns.

- - - - - -

Tips

- - -

Case studies

-

- In the example directories, you can find three case studies, baseNP - chunking, Text Chunking, and Japanese named entity recognition, to use CRF++. -

- -

- In each directory, please try the following commands -

- -
 % crf_learn template train model
- % crf_test  -m model test 
- -

To Do

- - -

References

- -
- -

$Id: index.html,v 1.23 2003/01/06 13:11:21 taku-ku Exp - $;

- -
- taku@chasen.org -
- - - diff --git a/doc/doxygen/annotated.html b/doxygen/annotated.html similarity index 100% rename from doc/doxygen/annotated.html rename to doxygen/annotated.html diff --git a/doc/doxygen/classCRFPP_1_1Model-members.html b/doxygen/classCRFPP_1_1Model-members.html similarity index 100% rename from doc/doxygen/classCRFPP_1_1Model-members.html rename to doxygen/classCRFPP_1_1Model-members.html diff --git a/doc/doxygen/classCRFPP_1_1Model.html b/doxygen/classCRFPP_1_1Model.html similarity index 100% rename from doc/doxygen/classCRFPP_1_1Model.html rename to doxygen/classCRFPP_1_1Model.html diff --git a/doc/doxygen/classCRFPP_1_1Tagger-members.html b/doxygen/classCRFPP_1_1Tagger-members.html similarity index 100% rename from doc/doxygen/classCRFPP_1_1Tagger-members.html rename to doxygen/classCRFPP_1_1Tagger-members.html diff --git a/doc/doxygen/classCRFPP_1_1Tagger.html b/doxygen/classCRFPP_1_1Tagger.html similarity index 100% rename from doc/doxygen/classCRFPP_1_1Tagger.html rename to doxygen/classCRFPP_1_1Tagger.html diff --git a/doc/doxygen/classes.html b/doxygen/classes.html similarity index 100% rename from doc/doxygen/classes.html rename to doxygen/classes.html diff --git a/doc/doxygen/crfpp_8h-source.html b/doxygen/crfpp_8h-source.html similarity index 100% rename from doc/doxygen/crfpp_8h-source.html rename to doxygen/crfpp_8h-source.html diff --git a/doc/doxygen/crfpp_8h.html b/doxygen/crfpp_8h.html similarity index 100% rename from doc/doxygen/crfpp_8h.html rename to doxygen/crfpp_8h.html diff --git a/doc/doxygen/crfpp_8h_source.html b/doxygen/crfpp_8h_source.html similarity index 100% rename from doc/doxygen/crfpp_8h_source.html rename to doxygen/crfpp_8h_source.html diff --git a/doc/doxygen/doxygen.css b/doxygen/doxygen.css similarity index 100% rename from doc/doxygen/doxygen.css rename to doxygen/doxygen.css diff --git a/doc/doxygen/doxygen.png b/doxygen/doxygen.png similarity index 100% rename from doc/doxygen/doxygen.png rename to doxygen/doxygen.png diff --git a/doc/doxygen/files.html b/doxygen/files.html similarity index 100% rename from doc/doxygen/files.html rename to doxygen/files.html diff --git a/doc/doxygen/functions.html b/doxygen/functions.html similarity index 100% rename from doc/doxygen/functions.html rename to doxygen/functions.html diff --git a/doc/doxygen/functions_func.html b/doxygen/functions_func.html similarity index 100% rename from doc/doxygen/functions_func.html rename to doxygen/functions_func.html diff --git a/doc/doxygen/globals.html b/doxygen/globals.html similarity index 100% rename from doc/doxygen/globals.html rename to doxygen/globals.html diff --git a/doc/doxygen/globals_func.html b/doxygen/globals_func.html similarity index 100% rename from doc/doxygen/globals_func.html rename to doxygen/globals_func.html diff --git a/doc/doxygen/globals_type.html b/doxygen/globals_type.html similarity index 100% rename from doc/doxygen/globals_type.html rename to doxygen/globals_type.html diff --git a/doc/doxygen/index.html b/doxygen/index.html similarity index 100% rename from doc/doxygen/index.html rename to doxygen/index.html diff --git a/doc/doxygen/namespaceCRFPP.html b/doxygen/namespaceCRFPP.html similarity index 100% rename from doc/doxygen/namespaceCRFPP.html rename to doxygen/namespaceCRFPP.html diff --git a/doc/doxygen/namespacemembers.html b/doxygen/namespacemembers.html similarity index 100% rename from doc/doxygen/namespacemembers.html rename to doxygen/namespacemembers.html diff --git a/doc/doxygen/namespacemembers_func.html b/doxygen/namespacemembers_func.html similarity index 100% rename from doc/doxygen/namespacemembers_func.html rename to doxygen/namespacemembers_func.html diff --git a/doc/doxygen/namespaces.html b/doxygen/namespaces.html similarity index 100% rename from doc/doxygen/namespaces.html rename to doxygen/namespaces.html diff --git a/doc/doxygen/tab_b.gif b/doxygen/tab_b.gif similarity index 100% rename from doc/doxygen/tab_b.gif rename to doxygen/tab_b.gif diff --git a/doc/doxygen/tab_l.gif b/doxygen/tab_l.gif similarity index 100% rename from doc/doxygen/tab_l.gif rename to doxygen/tab_l.gif diff --git a/doc/doxygen/tab_r.gif b/doxygen/tab_r.gif similarity index 100% rename from doc/doxygen/tab_r.gif rename to doxygen/tab_r.gif diff --git a/doc/doxygen/tabs.css b/doxygen/tabs.css similarity index 100% rename from doc/doxygen/tabs.css rename to doxygen/tabs.css diff --git a/index.html b/index.html index 802992c..6e5ea17 100644 --- a/index.html +++ b/index.html @@ -1 +1,831 @@ -Hello world + + + + + + CRF++: Yet Another CRF toolkit + + + + +

CRF++: Yet Another CRF toolkit

+ +

Introduction

+ +

CRF++ is a simple, customizable, and open source + implementation of Conditional Random Fields (CRFs) + for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as + Named Entity Recognition, Information Extraction and Text Chunking. + +

Table of contents

+ + +

Features

+ + + +

News

+ + +

Download

+ + + +

Installation

+ + + +

Usage

+ +

Training and Test file formats

+ +

Both the training file and the test file need to be in a + particular format for CRF++ to work properly. + Generally speaking, training and test file must consist of + multiple tokens. In addition, a token + consists of multiple (but fixed-numbers) columns. The + definition of tokens depends on tasks, however, in + most of typical cases, they simply correspond to + words. Each token must be represented in one line, + with the columns separated by white space (spaces or + tabular characters). A sequence of token becomes a + sentence. To identify the boundary between + sentences, an empty line is put.

+ +

You can give as many columns as you like, however the + number of columns must be fixed through all tokens. + Furthermore, there are some kinds of "semantics" among the + columns. For example, 1st column is 'word', second column + is 'POS tag' third column is 'sub-category of POS' and so + on.

+ +

The last column represents a true answer tag which is going + to be trained by CRF.

+ +

Here's an example of such a file: (data for CoNLL shared + task)

+
+He        PRP  B-NP
+reckons   VBZ  B-VP
+the       DT   B-NP
+current   JJ   I-NP
+account   NN   I-NP
+deficit   NN   I-NP
+will      MD   B-VP
+narrow    VB   I-VP
+to        TO   B-PP
+only      RB   B-NP
+#         #    I-NP
+1.8       CD   I-NP
+billion   CD   I-NP
+in        IN   B-PP
+September NNP  B-NP
+.         .    O
+
+He        PRP  B-NP
+reckons   VBZ  B-VP
+..
+
+ +

There are 3 columns for each token.

+ + + +

The following data is invalid, since the number of + columns of second and third are 2. (They have no POS + column.) The number of columns should be fixed.

+
+He        PRP  B-NP
+reckons   B-VP
+the       B-NP
+current   JJ   I-NP
+account   NN   I-NP
+..
+
+ +

Preparing feature templates

+

+ As CRF++ is designed as a general purpose tool, you have to + specify the feature templates in advance. This file describes + which features are used in training and testing. +

+ + + + + +

Training (encoding)

+ +

Use crf_learn command: +

+% crf_learn template_file train_file model_file
+
+

+where template_file and train_file +are the files you need to prepare in advance. +crf_learn generates the trained model file in +model_file. +

+ +

crf_learn outputs the following information.

+
+CRF++: Yet Another CRF Tool Kit
+Copyright(C) 2005 Taku Kudo, All rights reserved.
+
+reading training data: 100.. 200.. 300.. 400.. 500.. 600.. 700.. 800.. 
+Done! 1.94 s
+
+Number of sentences: 823
+Number of features:  1075862
+Number of thread(s): 1
+Freq:                1
+eta:                 0.00010
+C:                   1.00000
+shrinking size:      20
+Algorithm:           CRF
+
+iter=0 terr=0.99103 serr=1.00000 obj=54318.36623 diff=1.00000
+iter=1 terr=0.35260 serr=0.98177 obj=44996.53537 diff=0.17161
+iter=2 terr=0.35260 serr=0.98177 obj=21032.70195 diff=0.53257
+iter=3 terr=0.23879 serr=0.94532 obj=13642.32067 diff=0.35138
+iter=4 terr=0.15324 serr=0.88700 obj=8985.70071 diff=0.34134
+iter=5 terr=0.11605 serr=0.80680 obj=7118.89846 diff=0.20775
+iter=6 terr=0.09305 serr=0.72175 obj=5531.31015 diff=0.22301
+iter=7 terr=0.08132 serr=0.68408 obj=4618.24644 diff=0.16507
+iter=8 terr=0.06228 serr=0.59174 obj=3742.93171 diff=0.18953
+
+ + + +

+There are 4 major parameters to control the training condition +

+ +

Here is the example where these two parameters are used.

+
+% crf_learn -f 3 -c 1.5 template_file train_file model_file
+
+

Since version 0.45, CRF++ supports single-best MIRA training. +MIRA training is used when -a MIRA option is set. +

+% crf_learn -a MIRA template train.data model
+CRF++: Yet Another CRF Tool Kit
+Copyright(C) 2005 Taku Kudo, All rights reserved.
+
+reading training data: 100.. 200.. 300.. 400.. 500.. 600.. 700.. 800.. 
+Done! 1.92 s
+
+Number of sentences: 823
+Number of features:  1075862
+Number of thread(s): 1
+Freq:                1
+eta:                 0.00010
+C:                   1.00000
+shrinking size:      20
+Algorithm:           MIRA
+
+iter=0 terr=0.11381 serr=0.74605 act=823 uact=0 obj=24.13498 kkt=28.00000
+iter=1 terr=0.04710 serr=0.49818 act=823 uact=0 obj=35.42289 kkt=7.60929
+iter=2 terr=0.02352 serr=0.30741 act=823 uact=0 obj=41.86775 kkt=5.74464
+iter=3 terr=0.01836 serr=0.25881 act=823 uact=0 obj=47.29565 kkt=6.64895
+iter=4 terr=0.01106 serr=0.17011 act=823 uact=0 obj=50.68792 kkt=3.81902
+iter=5 terr=0.00610 serr=0.10085 act=823 uact=0 obj=52.58096 kkt=3.98915
+iter=0 terr=0.11381 serr=0.74605 act=823 uact=0 obj=24.13498 kkt=28.00000
+...
+
+ + + +

There are some parameters to control the MIRA training condition

+ + +

Testing (decoding)

+ +

Use crf_test command: +

+% crf_test -m model_file test_files ...
+
+

+where model_file is the file crf_learncreates. +In the testing, you don't need to specify the template file, +because the model file has the same information for the template. +test_file is the test data you want to assign sequential tags. +This file has to be written in the same format as training file. +

+ + +

+Here is an output of crf_test:

+ +
+% crf_test -m model test.data
+Rockwell        NNP     B       B
+International   NNP     I       I
+Corp.   NNP     I       I
+'s      POS     B       B
+Tulsa   NNP     I       I
+unit    NN      I       I
+..
+
+ +

The last column is given (estimated) tag. +If the 3rd column is true answer tag , you can evaluate the accuracy +by simply seeing the difference between the 3rd and 4th columns.

+ + + + + +

Tips

+ + +

Case studies

+

+ In the example directories, you can find three case studies, baseNP + chunking, Text Chunking, and Japanese named entity recognition, to use CRF++. +

+ +

+ In each directory, please try the following commands +

+ +
 % crf_learn template train model
+ % crf_test  -m model test 
+ +

To Do

+ + +

References

+ +
+ +

$Id: index.html,v 1.23 2003/01/06 13:11:21 taku-ku Exp + $;

+ +
+ taku@chasen.org +
+ + + diff --git a/winmain.h b/winmain.h deleted file mode 100644 index 74d3a02..0000000 --- a/winmain.h +++ /dev/null @@ -1,69 +0,0 @@ -// -// CRF++ -- Yet Another CRF toolkit -// -// $Id: common.h 1588 2007-02-12 09:03:39Z taku $; -// -// Copyright(C) 2005-2007 Taku Kudo -// -#if defined(_WIN32) || defined(__CYGWIN__) - -#include -#include - -namespace { -class CommandLine { - public: - CommandLine(int argc, wchar_t **argv) : argc_(argc), argv_(0) { - argv_ = new char * [argc_]; - for (int i = 0; i < argc_; ++i) { - const std::string arg = WideToUtf8(argv[i]); - argv_[i] = new char[arg.size() + 1]; - ::memcpy(argv_[i], arg.data(), arg.size()); - argv_[i][arg.size()] = '\0'; - } - } - ~CommandLine() { - for (int i = 0; i < argc_; ++i) { - delete [] argv_[i]; - } - delete [] argv_; - } - - int argc() const { return argc_; } - char **argv() const { return argv_; } - - private: - static std::string WideToUtf8(const std::wstring &input) { - const int output_length = ::WideCharToMultiByte(CP_UTF8, 0, - input.c_str(), -1, NULL, 0, - NULL, NULL); - if (output_length == 0) { - return ""; - } - - char *input_encoded = new char[output_length + 1]; - const int result = ::WideCharToMultiByte(CP_UTF8, 0, input.c_str(), -1, - input_encoded, - output_length + 1, NULL, NULL); - std::string output; - if (result > 0) { - output.assign(input_encoded); - } - delete [] input_encoded; - return output; - } - - int argc_; - char **argv_; -}; -} // namespace - -#define main(argc, argv) wmain_to_main_wrapper(argc, argv) - -int wmain_to_main_wrapper(int argc, char **argv); - -int wmain(int argc, wchar_t **argv) { - CommandLine cmd(argc, argv); - return wmain_to_main_wrapper(cmd.argc(), cmd.argv()); -} -#endif