How to reconstruct Vietnamese diacritics

Data Preparation

This project utilizes nearly 10 million Vietnamese sentences collected by the research team at news-corpus.

To download the dataset (corpus-title.txt - 578MB), please visit the following link: Google Drive.

Data Collection, Processing & Model Training

# Generate text without diacritics from raw text
# This script was run once in version 0.0.0
nohup python main.py collection > logs/data_collection_0_0_1.log 2>&1 &

# Generate X and y datasets from text without diacritics and raw text (approx. 10 minutes)
# Note: During data preprocessing, uppercase letters were not converted to lowercase.
# (This has been fixed in the updated encoder function, which now ensures all characters are lowercase and within the input dictionary.)
# As a result, uppercase characters were encoded as 0.
# However, initial testing with a few cases suggests this issue does not significantly affect model performance at this stage.
nohup python main.py processing > logs/data_processing_0_0_1.log 2>&1 &
# Processed 9,487,416 samples. Saved: X -> X_transformer.pt, y -> y_transformer.pt

# Train the model: memory usage stabilized below 30GB.
# The first epoch took approximately 30 hours.
nohup python main.py building > logs/model_building_0_0_1.log 2>&1 &
# Data loaded: X -> torch.Size([9487416, 150]), y -> torch.Size([9487416, 150])
# Epoch 1/20
#   Train Loss: 0.3221 | Train Acc: 0.0972
#   Val Loss:   0.1924 | Val Acc:   0.1020
#   Saved best train model with loss 0.32209213210086346.
#   Saved best val model with loss 0.1923793297414913.

Reconstructing Vietnamese Diacritics

Below are the demo results using the model trained for one epoch (approximately 30 hours of training).

Training dataset: 8 million sentences.
Validation dataset: 2 million sentences.
Model parameters: 1,087,051.

python main.py use_app
# Mô hình đã được tải thành công!
#
# Chọn tùy chọn:
# 1: Nhập một câu có dấu và kiểm tra dự đoán khôi phục dấu từ câu không dấu.
# 2: Nhập một câu không dấu để mô hình dự đoán thêm dấu.
# q: Thoát.
#
# Chọn (1/2/q): 2
# Nhập câu không dấu: Hom nay Ha Noi troi mua to qua.
# Dự đoán thêm dấu: Hôm nay Hà Nội trời mưa to quá.
#
# Chọn (1/2/q): 2
# Nhập câu không dấu: Ten cua toi la Pham Ngoc Hai.
# Dự đoán thêm dấu: Tên của tôi là Phạm Ngọc Hải.
#
# Chọn (1/2/q): 2
# Nhập câu không dấu: Tai buoc xu ly du lieu
# Dự đoán thêm dấu: Tại bước xử lý dữ liệu
#
# Chọn (1/2/q): 2
# Nhập câu không dấu: O buoc xu ly du lieu nay, toi quen mat chua viet thuong cac ky tu va boi vay nhung ky tu viet hoa bi ma hoa thanh gia tri 0.
# Dự đoán thêm dấu: ở bước xử lý dữ liệu này, tôi quên mặt chưa việt thường các ký từ và bởi vây những ký từ việt hóa bí mã hóa thanh giá trị 0.
#
# Chọn (1/2/q): 2
# Nhập câu không dấu: Do quen mat them ham viet thuong ky tu lai
# Dự đoán thêm dấu: đỏ quen mất thêm hầm việt thường ký từ lại
#
# Chọn (1/2/q): 2
# Nhập câu không dấu: Do func lower.() khong duoc dung, boi vay khi train, data ma ky tu viet hoa thi se ma hoa thanh 0
# Dự đoán thêm dấu: độ func lower.() không được dùng, bởi váy khi train, data mà ký từ việt hóa thi sẽ mà hóa thanh 0

It is evident that the model performs very well on sentences of moderate length with clear structures, accurately reconstructing diacritics. This showcases the advantage of a large dataset (nearly 10 million sentences).

With a model containing only 1 million parameters, the results are already impressive. However, for longer, more complex sentences or those with ambiguous structures, the model encounters some challenges in adding diacritics correctly. This is understandable because the training data consists of news headlines, which are designed to be short, attention-grabbing, and impactful to attract readers.

Additionally, the model has only been trained for one epoch (approximately 30 hours). While sentence-level accuracy remains limited, the loss function has already decreased significantly, indicating substantial success in character-level predictions.

Further updates on model training with additional epochs will be provided in the near future.

Model information

python main.py model_info

Result:

Transformer(
  (embedding): Embedding(137, 128)
  (encoder_layer): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
    )
    (linear1): Linear(in_features=128, out_features=256, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=256, out_features=128, bias=True)
    (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
  )
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-6): 7 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=256, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=256, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (output_layer): Linear(in_features=128, out_features=75, bias=True)
)
Number of parameters: 1087051

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
docs		docs
logs		logs
models		models
report		report
src		src
.gitignore		.gitignore
README.md		README.md
init_project.sh		init_project.sh
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to reconstruct Vietnamese diacritics

Data Preparation

Data Collection, Processing & Model Training

Reconstructing Vietnamese Diacritics

Model information

About

Releases 2

Packages

Languages

Harito97/Reconstruct_Vietnamese_diacritics

Folders and files

Latest commit

History

Repository files navigation

How to reconstruct Vietnamese diacritics

Data Preparation

Data Collection, Processing & Model Training

Reconstructing Vietnamese Diacritics

Model information

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages