Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[コーパス] - llm-jp-corpus v4 日本語重複除去 #118

Open
YumaTsuta opened this issue Feb 3, 2025 · 0 comments
Open

[コーパス] - llm-jp-corpus v4 日本語重複除去 #118

YumaTsuta opened this issue Feb 3, 2025 · 0 comments
Labels
pretrain Experiment of model pretrain

Comments

@YumaTsuta
Copy link
Collaborator

Overview

llm-jp-corpus v4のうち日本語データセットについて重複除去します

Details

llm-jp-corpus v4 のうち個別の日本語データセットについて重複除去後、さらに全体を通して重複除去します。
コーパス概要

Resources

  • 計算機
    • クラスタ: Sakura (Ishikari)
    • ノード種別: cpu
    • ノード台数: 9
  • コード
  • 入力データ:
    • {name}: {physical path}
  • 出力データ:
    • 保存先: {cluster}:/data/experiments/{number}
    • データ内訳:
      • {name}: xxx TB (バッファ容量を含む)
  • W&B ログ:
  • 開始日: YYYY-MM-DD
  • 終了予定日: YYYY-MM-DD (バッファ期間を含む)
@YumaTsuta YumaTsuta added the pretrain Experiment of model pretrain label Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pretrain Experiment of model pretrain
Projects
None yet
Development

No branches or pull requests

1 participant