HUE: Hanja Understanding Evaluation

The repository contains the HUE dataset and Hanja language models, and the code baselines to implement them. See our paper for more details about HUE and Hanja PLMs.

HUE dataset

Benchmark Datasets

HUE is composed of 4 tasks:

Chronological Attribution (CA)
Topic Classification (TC)
Named Entity Recognition (NER)
Summary Retrieval (SR)

HUE aims to encourage training Hanja language models that help to analyze the Korean historical documents written in Hanja which is an extinct language.

Dataset Size

You can download all 12 here, or individually from the table below:

	Train	Dev	Test
CA	330,469 (672.1 MB)	41,309 (84.2 MB)	41,309 (83.3 MB)
TC	330,424 (188.6 MB)	41,303 (23.8 MB)	41,304 (23.4 MB)
NER	385,915 (480.9 MB)	13,417 (8.4 MB)	13,418 (8.4 MB)
SR	169,840 (1.22 GB)	21,570 (155.8 MB)	21,296 (153.9 MB)

Hanja language models

Model Download

Model Name	Size
AnchiBERT + AJD/DRS	379.2 MB
mBERT + AJD/DRS	379.2 MB

Code Baselines

Make sure you have installed the packages listed in environment.yml. If you use conda, you can create an environment from this package with the following command:

conda env create -f environment.yml

Codes, data, and models should be placed in this directory tree.

HUE
├── code
│   ├── HUE_fine-tuning_Chronological_Attribution.ipynb
│   ├── HUE_fine-tuning_Named_Entity_Recognition.ipynb
│   ├── HUE_fine-tuning_Summary_Retrieval.ipynb
│   └── HUE_fine-tuning_Topic_Classification.ipynb
├── dataset
│   ├── HUE_Chronological_Attribution
│   │   ├── HUE_Chronological_Attribution.csv
│   │   ├── HUE_Chronological_Attribution_dev.csv
│   │   ├── HUE_Chronological_Attribution_test.csv
│   │   └── HUE_Chronological_Attribution_train.csv
│   ├── HUE_Named_Entity_Recognition
│   │   ├── HUE_Named_Entity_Recognition_dev.csv
│   │   ├── HUE_Named_Entity_Recognition_test.csv
│   │   └── HUE_Named_Entity_Recognition_train.csv
│   ├── HUE_Summary_Retrieval
│   │   ├── HUE_Summary_Retrieval_dev.csv
│   │   ├── HUE_Summary_Retrieval_test.csv
│   │   └── HUE_Summary_Retrieval_train.csv
│   └── HUE_Topic_Classification
│       ├── HUE_Topic_Classification.csv
│       ├── HUE_Topic_Classification_dev.csv
│       ├── HUE_Topic_Classification_test.csv
│       └── HUE_Topic_Classification_train.csv
├── model
│   ├── AnchiBERT+AJD-DRS
│   │   ├── config.json
│   │   ├── pytorch_model.bin
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer_config.json
│   │   └── vocab.txt
│   └── mBERT+AJD-DRS
│       ├── config.json
│       ├── pytorch_model.bin
│       ├── special_tokens_map.json
│       ├── tokenizer_config.json
│       └── vocab.txt
└── tokenizer
    ├── AnchiBERT+AJD-DRS
    │   ├── special_tokens_map.json
    │   ├── tokenizer_config.json
    │   └── vocab.txt
    └── mBERT+AJD-DRS
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        └── vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HUE: Hanja Understanding Evaluation

HUE dataset

Benchmark Datasets

Dataset Size

Hanja language models

Model Download

Code Baselines

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
README.md		README.md
environment.yml		environment.yml

haneul-yoo/HUE

Folders and files

Latest commit

History

Repository files navigation

HUE: Hanja Understanding Evaluation

HUE dataset

Benchmark Datasets

Dataset Size

Hanja language models

Model Download

Code Baselines

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages