Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[コーパス] - LLM-jp コーパスのCommonCrawlデータクリーニング #101

Open
ununtrium opened this issue Dec 19, 2024 · 0 comments
Assignees
Labels
pretrain Experiment of model pretrain

Comments

@ununtrium
Copy link

Overview

LLM-jp コーパスのCommonCrawlデータクリーニングに用いる。

Details

LLM-jp 事前学習コーパス増量のため、最新のCommonCrawlダンプをUzushioフィルタリングで処理する。

Resources

  • 計算機
    • クラスタ: Sakura
    • ノード種別: CPU
    • ノード台数: 1-2
  • 出力データ:
    • 保存先: llm-jp:/data/experiments/0101_llm-jp-corpus-filtering/
  • 開始日: 2024-12-19
  • 終了予定日: N/A
@ununtrium ununtrium added the pretrain Experiment of model pretrain label Dec 19, 2024
@ununtrium ununtrium changed the title [事前学習] - タイトルを入力してください [コーパス] - LLM-jp コーパスのCommonCrawlデータクリーニング Dec 19, 2024
@shuheikurita shuheikurita self-assigned this Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pretrain Experiment of model pretrain
Projects
None yet
Development

No branches or pull requests

2 participants