From 6daf7b204355e2f6f49b8949bf8795ccefdeb703 Mon Sep 17 00:00:00 2001 From: Kaito Sugimoto Date: Tue, 31 Dec 2024 15:40:13 +0900 Subject: [PATCH] add Llama 3.1 Swallow 70B v0.3 & UBKE-LUKE (#420) --- README.md | 15 ++++++++------- en/README.md | 15 ++++++++------- fr/README.md | 15 ++++++++------- 3 files changed, 24 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index d58db7f..e45005d 100644 --- a/README.md +++ b/README.md @@ -95,7 +95,7 @@ | | ベースのLLM | 学習テキスト | 開発元 | ライセンス / 利用規約 | |:---|:---:|:---:|:---:|:---:| -| [Llama 3.1 Swallow 70B](https://swallow-llm.github.io/llama3.1-swallow.ja.html)
([70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-v0.1), [70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1)) | Llama 3.1 (**70b**) | 事前学習: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus
Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie | Swallowプロジェクト | Llama 3.1 Community License
(Instructモデルは Gemma Terms of Use も適用) | +| [Llama 3.1 Swallow 70B](https://swallow-llm.github.io/llama3.1-swallow.ja.html)
([70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-v0.1), [70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1), [70B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3)) | Llama 3.1 (**70b**) | 事前学習: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus
Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie | Swallowプロジェクト | Llama 3.1 Community License
(Instructモデルは Gemma Terms of Use も適用) | | [cyberagent/Llama-3.1-70B-Japanese-Instruct-2407](https://huggingface.co/cyberagent/Llama-3.1-70B-Japanese-Instruct-2407) | Llama 3.1 (**70b**) | 不明 | サイバーエージェント | Llama 3.1 Community License | | [Llama 3 Swallow 70B](https://swallow-llm.github.io/llama3-swallow.ja.html)
([70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3-Swallow-70B-v0.1), [70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3-Swallow-70B-Instruct-v0.1)) | Llama 3 (**70b**) | 事前学習: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath
Instruction Tuning: OASST1 [^17] | Swallowプロジェクト | Llama 3 Community License | | [turing-motors/Llama-3-heron-brain-70B-v0.3](https://huggingface.co/turing-motors/Llama-3-heron-brain-70B-v0.3) | Llama 3 (**70b**) | Llama 3 Swallow 70B に対して追加学習(詳細不明) | Turing | Llama 3 Community License | @@ -263,19 +263,20 @@ | | ドメイン | アーキテクチャ | 学習テキスト | 開発元 | ライセンス | HuggingFace ですぐ使える? | |:---|:---:|:---:|:---:|:---:|:---:|:---:| -| [日本語ニュースBERT](https://qiita.com/mkt3/items/3c1278339ff1bcc0187f) | ビジネス | BERT (base) | 日本語ビジネスニュース記事(300万記事) | ストックマーク | CC BY 4.0 | △ | -| [日本語ニュースXLNet](https://qiita.com/mkt3/items/4d0ae36f3f212aee8002) | ビジネス | XLNet (base) | 日本語ビジネスニュース記事(300万記事) | ストックマーク | ? | ※ 非公式の HuggingFace 向けに変換されたモデルが[公開されている](https://huggingface.co/hajime9652/xlnet-japanese) | -| [日本語ニュースALBERT](https://qiita.com/mkt3/items/b41dcf0185e5873f5f75) | ビジネス | ALBERT (base) | 日本語ビジネスニュース記事(300万記事) | ストックマーク | ? | △ | | [日本語ブログELECTRA](https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E2-5.pdf) | 口語 | ELECTRA (small) | 日本語ブログコーパス(3億5,400万文) | 北見工大 桝井・プタシンスキ研 | CC BY-SA 4.0 | [◯](https://huggingface.co/ptaszynski/yacis-electra-small-japanese) | | [日本語話し言葉BERT](https://tech.retrieva.jp/entry/2021/04/01/114943) | 話し言葉 | BERT (base) | 東北大BERTに対して日本語話し言葉コーパス(CSJ)を用いて追加学習
(DAPTモデルでは国会議事録データも使用) | レトリバ | Apache 2.0 | [◯](https://huggingface.co/retrieva-jp/japanese-spoken-language-bert) | +| [AcademicRoBERTa](https://github.com/EhimeNLP/AcademicRoBERTa) | 学術 | RoBERTa (base) | CiNii の日本語論文 (約628万文) | 愛媛大 人工知能研究室 | Apache 2.0 | [◯](https://huggingface.co/EhimeNLP/AcademicRoBERTa) | +| [local-politics-BERT](http://local-politics.jp/%e5%85%ac%e9%96%8b%e7%89%a9/local-politics-bert/) | 政治 | BERT (base) | Wikipedia, 国会会議録, 地方議会会議録 | 地方議会会議録コーパスプロジェクト | CC BY-SA 4.0 | ◯ ([SC-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch), [SC-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch), [SC-2M-wiki](https://huggingface.co/local-politics-jp/bert-base-japanese-wikipedia-scratch-2m), [SC-2M-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch-2m), [SC-2M-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch-2m), [FP-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-further), [FP-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-further)) [^18] | +| [UBKE-LUKE](https://tech.uzabase.com/entry/2024/12/24/173942) | 経済 | LUKE (base) | 日本語 Wikipedia, 有価証券報告書, 経済ニュース記事 | ユーザベース | CC BY-NC | [◯](https://huggingface.co/uzabase/UBKE-LUKE) | | [日本語金融BERT](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | 金融 | BERT (small, base) [^9] | 日本語 Wikipedia
+ 日本語金融コーパス (約2,700万文 (5.2GB)) | 東大 和泉研 | CC BY-SA 4.0 |◯ ([small](https://huggingface.co/izumi-lab/bert-small-japanese-fin), [base](https://huggingface.co/izumi-lab/bert-base-japanese-fin-additional)) | | [日本語金融ELECTRA](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | 金融 | ELECTRA (small) | 日本語 Wikipedia (約2,000万文 (2.9GB))
+ 日本語金融コーパス (約2,700万文 (5.2GB)) | 東大 和泉研 | CC BY-SA 4.0 | [◯](https://huggingface.co/izumi-lab/electra-small-japanese-fin-discriminator) | +| [日本語ニュースBERT](https://qiita.com/mkt3/items/3c1278339ff1bcc0187f) | ビジネス | BERT (base) | 日本語ビジネスニュース記事(300万記事) | ストックマーク | CC BY 4.0 | △ | +| [日本語ニュースXLNet](https://qiita.com/mkt3/items/4d0ae36f3f212aee8002) | ビジネス | XLNet (base) | 日本語ビジネスニュース記事(300万記事) | ストックマーク | ? | ※ 非公式の HuggingFace 向けに変換されたモデルが[公開されている](https://huggingface.co/hajime9652/xlnet-japanese) | +| [日本語ニュースALBERT](https://qiita.com/mkt3/items/b41dcf0185e5873f5f75) | ビジネス | ALBERT (base) | 日本語ビジネスニュース記事(300万記事) | ストックマーク | ? | △ | +| [みんぱくBERT](https://proceedings-of-deim.github.io/DEIM2022/papers/F43-4.pdf) | 文化財 | BERT (base) | 東北大BERTに対して国立民族学博物館の文化財データで追加学習 | 兵庫県立大学 大島研 | MIT | ◯ ([minpaku-v1](https://huggingface.co/ohshimalab/bert-base-minpaku-v1), [minpaku-v3](https://huggingface.co/ohshimalab/bert-base-minpaku-v3), [minpaku-v3-no-additional-token](https://huggingface.co/ohshimalab/bert-base-minpaku-v3-no-additional-token)) | | [UTH-BERT](https://ai-health.m.u-tokyo.ac.jp/home/research/uth-bert) | 医療 | BERT (base) | 日本語診療記録(約1億2,000万行) | 東大病院
医療AI開発学講座 | CC BY-NC-SA 4.0 | △ | | [medBERTjp](https://github.com/ou-medinfo/medbertjp) | 医療 | BERT (base) | 日本語 Wikipedia
+ 日本語医療コーパス(『今日の診療プレミアム』Web版) | 阪大病院
医療情報学研究室 | CC BY-NC-SA 4.0 | △ | | [JMedRoBERTa](https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/P3-1.pdf) | 医療 | RoBERTa (base) | 日本語医学論文 (約1,100万文 (1.8GB)) | NII 相澤研 | CC BY-NC-SA 4.0 | ◯ ([万病WordPiece](https://huggingface.co/alabnii/jmedroberta-base-manbyo-wordpiece), [SentencePiece](https://huggingface.co/alabnii/jmedroberta-base-sentencepiece)) [^10] | -| [AcademicRoBERTa](https://github.com/EhimeNLP/AcademicRoBERTa) | 学術 | RoBERTa (base) | CiNii の日本語論文 (約628万文) | 愛媛大 人工知能研究室 | Apache 2.0 | [◯](https://huggingface.co/EhimeNLP/AcademicRoBERTa) | -| [みんぱくBERT](https://proceedings-of-deim.github.io/DEIM2022/papers/F43-4.pdf) | 文化財 | BERT (base) | 東北大BERTに対して国立民族学博物館の文化財データで追加学習 | 兵庫県立大学 大島研 | MIT | ◯ ([minpaku-v1](https://huggingface.co/ohshimalab/bert-base-minpaku-v1), [minpaku-v3](https://huggingface.co/ohshimalab/bert-base-minpaku-v3), [minpaku-v3-no-additional-token](https://huggingface.co/ohshimalab/bert-base-minpaku-v3-no-additional-token)) | -| [local-politics-BERT](http://local-politics.jp/%e5%85%ac%e9%96%8b%e7%89%a9/local-politics-bert/) | 政治 | BERT (base) | Wikipedia, 国会会議録, 地方議会会議録 | 地方議会会議録コーパスプロジェクト | CC BY-SA 4.0 | ◯ ([SC-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch), [SC-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch), [SC-2M-wiki](https://huggingface.co/local-politics-jp/bert-base-japanese-wikipedia-scratch-2m), [SC-2M-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch-2m), [SC-2M-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch-2m), [FP-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-further), [FP-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-further)) [^18] | ## 埋め込み (Embeddings) 作成に特化したモデル [^21] diff --git a/en/README.md b/en/README.md index 868d40e..789d551 100644 --- a/en/README.md +++ b/en/README.md @@ -94,7 +94,7 @@ Please point out any errors on the [issues page](https://github.com/llm-jp/aweso | | Base Model | Training Data | Developer | License / Terms of Use | |:---|:---:|:---:|:---:|:---:| -| [Llama 3.1 Swallow 70B](https://swallow-llm.github.io/llama3.1-swallow.en.html)
([70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-v0.1), [70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1)) | Llama 3.1 (**70b**) | Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus
Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie | Swallow Project | Llama 3.1 Community License
(Gemma Terms of Use is also applied to the Instruct model) | +| [Llama 3.1 Swallow 70B](https://swallow-llm.github.io/llama3.1-swallow.en.html)
([70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-v0.1), [70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1), [70B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3)) | Llama 3.1 (**70b**) | Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus
Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie | Swallow Project | Llama 3.1 Community License
(Gemma Terms of Use is also applied to the Instruct model) | | [cyberagent/Llama-3.1-70B-Japanese-Instruct-2407](https://huggingface.co/cyberagent/Llama-3.1-70B-Japanese-Instruct-2407) | Llama 3.1 (**70b**) | undisclosed | CyberAgent | Llama 3.1 Community License | | [Llama 3 Swallow 70B](https://swallow-llm.github.io/llama3-swallow.en.html)
([70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3-Swallow-70B-v0.1), [70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3-Swallow-70B-Instruct-v0.1)) | Llama 3 (**70b**) | Pre-training: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath
Instruction Tuning: OASST1 [^17] | Swallow Project | Llama 3 Community License | | [turing-motors/Llama-3-heron-brain-70B-v0.3](https://huggingface.co/turing-motors/Llama-3-heron-brain-70B-v0.3) | Llama 3 (**70b**) | additionally trained on Llama 3 Swallow 70B (details undisclosed) | Turing | Llama 3 Community License | @@ -261,19 +261,20 @@ Please point out any errors on the [issues page](https://github.com/llm-jp/aweso | | Domain | Architecture | Training Data | Developer | License | HuggingFace? | |:---|:---:|:---:|:---:|:---:|:---:|:---:| -| [JapaneseNewsBERT](https://qiita.com/mkt3/items/3c1278339ff1bcc0187f) | Business | BERT (base) | Japanese Business Articles (3M articles) | Stockmark | CC BY 4.0 | △ | -| [JapaneseNewsXLNet](https://qiita.com/mkt3/items/4d0ae36f3f212aee8002) | Business | XLNet (base) | Japanese Business Articles (3M articles) | Stockmark | ? | [◯](https://huggingface.co/hajime9652/xlnet-japanese)
※ Unofficial release | -| [JapaneseNewsALBERT](https://qiita.com/mkt3/items/b41dcf0185e5873f5f75) | Business | ALBERT (base) | Japanese Business Articles (3M articles) | Stockmark | ? | △ | | [JapaneseBlogELECTRA](https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E2-5.pdf) | Colloquial language | ELECTRA (small) | Japanese Blog Corpus (354M sentences) | Kitami Institute of Technology Masui-Ptaszynski Lab | CC BY‑SA 4.0 | [◯](https://huggingface.co/ptaszynski/yacis-electra-small-japanese) | | [JapaneseSpokenLanguageBERT](https://huggingface.co/retrieva-jp/japanese-spoken-language-bert) | Spoken language | BERT (base) | Additional training for TohokuUniversityBERT using Corpus of Spontaneous Japanese (CSJ)
(In the DAPT model, the diet record is also used) | Retrieva | Apache 2.0 | [◯](https://huggingface.co/retrieva-jp/japanese-spoken-language-bert) | +| [AcademicRoBERTa](https://github.com/EhimeNLP/AcademicRoBERTa) | Science | RoBERTa (base) | CiNii Japanese Papers (6.3M sentences) | Ehime University AI Lab | Apache 2.0 | [◯](https://huggingface.co/EhimeNLP/AcademicRoBERTa) | +| [local-politics-BERT](http://local-politics.jp/%e5%85%ac%e9%96%8b%e7%89%a9/local-politics-bert/) | Politics | BERT (base) | Wikipedia, Minutes of the National Diet, Minutes of the Local Assembly | Japanese Local Assembly Minutes Corpus Project | CC BY-SA 4.0 | ◯ ([SC-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch), [SC-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch), [SC-2M-wiki](https://huggingface.co/local-politics-jp/bert-base-japanese-wikipedia-scratch-2m), [SC-2M-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch-2m), [SC-2M-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch-2m), [FP-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-further), [FP-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-further)) [^18] | +| [UBKE-LUKE](https://tech.uzabase.com/entry/2024/12/24/173942) | Economics | LUKE (base) | Japanese Wikipedia, Securities Reports, Economic News Articles | Uzabase | CC BY-NC | [◯](https://huggingface.co/uzabase/UBKE-LUKE) | | [JapaneseFinancialBERT](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | Finance | BERT (small, base)[^9] | Japanese Wikipedia, Japanese Financial Corpus (27M sentences/5.2GB) | University of Tokyo Izumi Lab | CC BY‑SA 4.0 |◯
([small](https://huggingface.co/izumi-lab/bert-small-japanese-fin), [base](https://huggingface.co/izumi-lab/bert-base-japanese-fin-additional)) | | [JapaneseFinancialELECTRA](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | Finance | ELECTRA (small) | Japanese Wikipedia (20M sentences/2.9GB), Japanese Financial Corpus (27M sentences/5.2GB) | University of Tokyo Izumi Lab | CC BY‑SA 4.0 | [◯](https://huggingface.co/izumi-lab/electra-small-japanese-fin-discriminator) | +| [JapaneseNewsBERT](https://qiita.com/mkt3/items/3c1278339ff1bcc0187f) | Business | BERT (base) | Japanese Business Articles (3M articles) | Stockmark | CC BY 4.0 | △ | +| [JapaneseNewsXLNet](https://qiita.com/mkt3/items/4d0ae36f3f212aee8002) | Business | XLNet (base) | Japanese Business Articles (3M articles) | Stockmark | ? | [◯](https://huggingface.co/hajime9652/xlnet-japanese)
※ Unofficial release | +| [JapaneseNewsALBERT](https://qiita.com/mkt3/items/b41dcf0185e5873f5f75) | Business | ALBERT (base) | Japanese Business Articles (3M articles) | Stockmark | ? | △ | +| [MinpakuBERT](https://proceedings-of-deim.github.io/DEIM2022/papers/F43-4.pdf) | Cultural Heritage | BERT (base) | Additional training with National Museum of Ethnology's cultural heritage data on top of Tohoku University BERT | University of Hyogo Ohshima Lab | MIT | ◯ ([minpaku-v1](https://huggingface.co/ohshimalab/bert-base-minpaku-v1), [minpaku-v3](https://huggingface.co/ohshimalab/bert-base-minpaku-v3), [minpaku-v3-no-additional-token](https://huggingface.co/ohshimalab/bert-base-minpaku-v3-no-additional-token)) | | [UTH-BERT](https://ai-health.m.u-tokyo.ac.jp/home/research/uth-bert) | Medicine | BERT (base) | Japanese Medical Records(120M lines) | University of Tokyo Hospital
Medical AI Development Course | CC BY‑NC‑SA 4.0 | △ | | [medBERTjp](https://github.com/ou-medinfo/medbertjp) | Medicine | BERT (base) | Japanese Wikipedia, Japanese Medical Corpus ("今日の診療プレミアム/Today's Care Premium" Web Version) | Osaka University Hospital
Medical Informatics Lab | CC BY‑NC‑SA 4.0 | △ | | [JMedRoBERTa](https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/P3-1.pdf) | Medicine | RoBERTa (base) | Japanese Medical Papers (11M sentences/1.8GB) | NII Aizawa Lab | CC BY‑NC‑SA 4.0 | ◯
([ManbyoWordPiece](https://huggingface.co/alabnii/jmedroberta-base-manbyo-wordpiece), [SentencePiece](https://huggingface.co/alabnii/jmedroberta-base-sentencepiece))[^10] | -| [AcademicRoBERTa](https://github.com/EhimeNLP/AcademicRoBERTa) | Science | RoBERTa (base) | CiNii Japanese Papers (6.3M sentences) | Ehime University AI Lab | Apache 2.0 | [◯](https://huggingface.co/EhimeNLP/AcademicRoBERTa) | -| [MinpakuBERT](https://proceedings-of-deim.github.io/DEIM2022/papers/F43-4.pdf) | Cultural Heritage | BERT (base) | Additional training with National Museum of Ethnology's cultural heritage data on top of Tohoku University BERT | University of Hyogo Ohshima Lab | MIT | ◯ ([minpaku-v1](https://huggingface.co/ohshimalab/bert-base-minpaku-v1), [minpaku-v3](https://huggingface.co/ohshimalab/bert-base-minpaku-v3), [minpaku-v3-no-additional-token](https://huggingface.co/ohshimalab/bert-base-minpaku-v3-no-additional-token)) | -| [local-politics-BERT](http://local-politics.jp/%e5%85%ac%e9%96%8b%e7%89%a9/local-politics-bert/) | Politics | BERT (base) | Wikipedia, Minutes of the National Diet, Minutes of the Local Assembly | Japanese Local Assembly Minutes Corpus Project | CC BY-SA 4.0 | ◯ ([SC-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch), [SC-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch), [SC-2M-wiki](https://huggingface.co/local-politics-jp/bert-base-japanese-wikipedia-scratch-2m), [SC-2M-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch-2m), [SC-2M-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch-2m), [FP-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-further), [FP-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-further)) [^18] | ## Sentence and Document Embeddings [^21] diff --git a/fr/README.md b/fr/README.md index efaac1f..583c8cc 100644 --- a/fr/README.md +++ b/fr/README.md @@ -94,7 +94,7 @@ N'hésitez pas à signaler les erreurs sur la page [issues](https://github.com/l | | Base du Model | Données d'entraînement | Développeur | Licence / Conditions d'utilisation | |:---|:---:|:---:|:---:|:---:| -| [Llama 3.1 Swallow 70B](https://swallow-llm.github.io/llama3.1-swallow.en.html)
([70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-v0.1), [70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1)) | Llama 3.1 (**70b**) | Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus
Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie | Swallow Project | Llama 3.1 Community License
(Gemma Terms of Use is also applied to the Instruct model) | +| [Llama 3.1 Swallow 70B](https://swallow-llm.github.io/llama3.1-swallow.en.html)
([70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-v0.1), [70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1), [70B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3)) | Llama 3.1 (**70b**) | Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus
Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie | Swallow Project | Llama 3.1 Community License
(Gemma Terms of Use is also applied to the Instruct model) | | [cyberagent/Llama-3.1-70B-Japanese-Instruct-2407](https://huggingface.co/cyberagent/Llama-3.1-70B-Japanese-Instruct-2407) | Llama 3.1 (**70b**) | undisclosed | CyberAgent | Llama 3.1 Community License | | [Llama 3 Swallow 70B](https://swallow-llm.github.io/llama3-swallow.en.html)
([70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3-Swallow-70B-v0.1), [70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3-Swallow-70B-Instruct-v0.1)) | Llama 3 (**70b**) | Pre-training: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath
Instruction Tuning: OASST1 [^17] | Swallow Project | Llama 3 Community License | | [turing-motors/Llama-3-heron-brain-70B-v0.3](https://huggingface.co/turing-motors/Llama-3-heron-brain-70B-v0.3) | Llama 3 (**70b**) | additionally trained on Llama 3 Swallow 70B (details undisclosed) | Turing | Llama 3 Community License | @@ -261,19 +261,20 @@ N'hésitez pas à signaler les erreurs sur la page [issues](https://github.com/l | | Domaine | Architecture | Données d'entraînement | Développeur | Licence | HuggingFace? | |:---|:---:|:---:|:---:|:---:|:---:|:---:| -| [JapaneseNewsBERT](https://qiita.com/mkt3/items/3c1278339ff1bcc0187f) | Affaires | BERT (base) | Articles sur l'économie en japonais(3M articles) | Stockmark | CC BY 4.0 | △ | -| [JapaneseNewsXLNet](https://qiita.com/mkt3/items/4d0ae36f3f212aee8002) | Affaires | XLNet (base) | Articles sur l'économie en japonais (3M articles) | Stockmark | ? | [◯](https://huggingface.co/hajime9652/xlnet-japanese)
※ Version non officielle | -| [JapaneseNewsALBERT](https://qiita.com/mkt3/items/b41dcf0185e5873f5f75) | Affaires | ALBERT (base) | Articles sur l'économie en japonais (3M articles) | Stockmark | ? | △ | | [JapaneseBlogELECTRA](https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E2-5.pdf) | Langue familière | ELECTRA (small) | Corpus de blogs en japonais (354M sentences) | Université de technologie de Kitami - Laboratoire de Masui-Ptaszynski | CC BY‑SA 4.0 | [◯](https://huggingface.co/ptaszynski/yacis-electra-small-japanese) | | [JapaneseSpokenLanguageBERT](https://huggingface.co/retrieva-jp/japanese-spoken-language-bert) | Langue parlée | BERT (base) | Formation supplémentaire pour TohokuUniversityBERT en utilisant le Corpus of Spontaneous Japanese (CSJ)
(Dans le modèle DAPT, le compte rendu de la diète est également utilisé) | Retrieva | Apache 2.0 | [◯](https://huggingface.co/retrieva-jp/japanese-spoken-language-bert) | +| [AcademicRoBERTa](https://github.com/EhimeNLP/AcademicRoBERTa) | Science | RoBERTa (base) | CiNii Japanese Papers (6.3M sentences) | Université d'Ehime Laboratoire IA | Apache 2.0 | [◯](https://huggingface.co/EhimeNLP/AcademicRoBERTa) | +| [local-politics-BERT](http://local-politics.jp/%e5%85%ac%e9%96%8b%e7%89%a9/local-politics-bert/) | Politique | BERT (base) | Procès-verbaux de la Diète nationale, Procès-verbaux de l'Assemblée locale | Projet de Corpus des Procès-Verbaux des Assemblées Locales Japonaises | CC BY-SA 4.0 | ◯ ([SC-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch), [SC-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch), [SC-2M-wiki](https://huggingface.co/local-politics-jp/bert-base-japanese-wikipedia-scratch-2m), [SC-2M-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch-2m), [SC-2M-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch-2m), [FP-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-further), [FP-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-further)) [^18] | +| [UBKE-LUKE](https://tech.uzabase.com/entry/2024/12/24/173942) | Économie | LUKE (base) | Wikipédia en japonais, Rapports financiers, Articles de nouvelles économiques | Uzabase | CC BY-NC | [◯](https://huggingface.co/uzabase/UBKE-LUKE) | | [JapaneseFinancialBERT](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | Finance |BERT (small, base)[^9] | Wikipédia en japonais, Japanese Financial Corpus (27M sentences/5.2GB) | Université de Tokyo Izumi Lab | CC BY‑SA 4.0 |◯
([small](https://huggingface.co/izumi-lab/bert-small-japanese-fin), [base](https://huggingface.co/izumi-lab/bert-base-japanese-fin-additional)) | | [JapaneseFinancialELECTRA](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | Finance | ELECTRA (small) | Wikipédia en japonais (20M sentences/2.9GB), Japanese Financial Corpus (27M sentences/5.2GB) | Université de Tokyo Izumi Lab | CC BY‑SA 4.0 | [◯](https://huggingface.co/izumi-lab/electra-small-japanese-fin-discriminator) | +| [JapaneseNewsBERT](https://qiita.com/mkt3/items/3c1278339ff1bcc0187f) | Affaires | BERT (base) | Articles sur l'économie en japonais(3M articles) | Stockmark | CC BY 4.0 | △ | +| [JapaneseNewsXLNet](https://qiita.com/mkt3/items/4d0ae36f3f212aee8002) | Affaires | XLNet (base) | Articles sur l'économie en japonais (3M articles) | Stockmark | ? | [◯](https://huggingface.co/hajime9652/xlnet-japanese)
※ Version non officielle | +| [JapaneseNewsALBERT](https://qiita.com/mkt3/items/b41dcf0185e5873f5f75) | Affaires | ALBERT (base) | Articles sur l'économie en japonais (3M articles) | Stockmark | ? | △ | +| [MinpakuBERT](https://proceedings-of-deim.github.io/DEIM2022/papers/F43-4.pdf) | Patrimoine culturel | BERT (base) | Formation supplémentaire avec les données du patrimoine culturel du Musée national d'ethnologie sur Tohoku University BERT | Université de Hyogo Ohshima Lab | MIT | ◯ ([minpaku-v1](https://huggingface.co/ohshimalab/bert-base-minpaku-v1), [minpaku-v3](https://huggingface.co/ohshimalab/bert-base-minpaku-v3), [minpaku-v3-no-additional-token](https://huggingface.co/ohshimalab/bert-base-minpaku-v3-no-additional-token)) | | [UTH-BERT](https://ai-health.m.u-tokyo.ac.jp/home/research/uth-bert) | Médecine | BERT (base) | Dossiers médicaux en japonais (120M lignes) | Université de Tokyo Hôpital
Cours de développement en IA pour la médecine | CC BY‑NC‑SA 4.0 | △ | | [medBERTjp](https://github.com/ou-medinfo/medbertjp) | Médecine | BERT (base) | Wikipédia en japonais, Corpus médical en japonais ("今日の診療プレミアム/Today's Care Premium" Web Version) | Université d'Osaka Hôpital
Laboratoire d'information médicale | CC BY‑NC‑SA 4.0 | △ | | [JMedRoBERTa](https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/P3-1.pdf) | Médecine | RoBERTa (base) | Japanese Medical Papers (11M sentences/1.8GB) | NII Aizawa Lab | CC BY‑NC‑SA 4.0 | ◯
([ManbyoWordPiece](https://huggingface.co/alabnii/jmedroberta-base-manbyo-wordpiece), [SentencePiece](https://huggingface.co/alabnii/jmedroberta-base-sentencepiece))[^10] | -| [AcademicRoBERTa](https://github.com/EhimeNLP/AcademicRoBERTa) | Science | RoBERTa (base) | CiNii Japanese Papers (6.3M sentences) | Université d'Ehime Laboratoire IA | Apache 2.0 | [◯](https://huggingface.co/EhimeNLP/AcademicRoBERTa) | -| [MinpakuBERT](https://proceedings-of-deim.github.io/DEIM2022/papers/F43-4.pdf) | Patrimoine culturel | BERT (base) | Formation supplémentaire avec les données du patrimoine culturel du Musée national d'ethnologie sur Tohoku University BERT | Université de Hyogo Ohshima Lab | MIT | ◯ ([minpaku-v1](https://huggingface.co/ohshimalab/bert-base-minpaku-v1), [minpaku-v3](https://huggingface.co/ohshimalab/bert-base-minpaku-v3), [minpaku-v3-no-additional-token](https://huggingface.co/ohshimalab/bert-base-minpaku-v3-no-additional-token)) | -| [local-politics-BERT](http://local-politics.jp/%e5%85%ac%e9%96%8b%e7%89%a9/local-politics-bert/) | Politique | BERT (base) | Procès-verbaux de la Diète nationale, Procès-verbaux de l'Assemblée locale | Projet de Corpus des Procès-Verbaux des Assemblées Locales Japonaises | CC BY-SA 4.0 | ◯ ([SC-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch), [SC-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch), [SC-2M-wiki](https://huggingface.co/local-politics-jp/bert-base-japanese-wikipedia-scratch-2m), [SC-2M-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-scratch-2m), [SC-2M-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-scratch-2m), [FP-min](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-further), [FP-minwiki](https://huggingface.co/local-politics-jp/bert-base-japanese-minutes-wikipedia-further)) [^18] | ## Plongement lexical par mots et par documents [^21]