RobotsMali-AI
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 107 additions & 76 deletions b/‎README.md‎
Lines changed: 107 additions & 76 deletions
diff --git a/‎docs/README.old.md‎
Lines changed: 105 additions & 0 deletions b/‎docs/README.old.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎docs/requirements.txt‎
Lines changed: 51 additions & 0 deletions b/‎docs/requirements.txt‎
Lines changed: 51 additions & 0 deletions
diff --git a/‎jeli/LICENSE‎
Lines changed: 19 additions & 0 deletions b/‎jeli/LICENSE‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎jeli/TODO.md‎
Lines changed: 9 additions & 0 deletions b/‎jeli/TODO.md‎
Lines changed: 9 additions & 0 deletions
@@ -1,3 +1,4 @@
 *.wav
 *.WAV
 *.py
+cleanup.sh
@@ -1,87 +1,106 @@
-# Jeli ASR & Corpus
+# Jeli ASR & Dataset
 
 ## What is Jeli-ASR
-Jeli-ASR is a multidimentional package that was developed with the aim to empower the usage of the Bambara Language. Starting in an initiative to the develop the Bambara Language, and its cultural values. The package is consisted of an ASR model under ongoing development, and a mini corpus of griots narration in [audio](https://zenodo.org/record/6997806), its transcription in eaf which is [ELAN format](https://archive.mpi.nl/tla/elan/download), and a package tool that can yield the transcription in raw text format or json.
+This is a multidimentional open-source package consisting of a dataset & an ASR model. The dataset consists of the transcriptions of 30 hours of griots stories and narrations, and their translations. The corresponding [audio](https://zenodo.org/record/7094702) is hosted on zenodo. The ASR model is an ongoing attempt at an automatic speech recognition model for bambara.
+
+## Dataset
+The Griots corpus is a speech corpus containing both audio and its accompanying transcribed text. You can find the intent, the approaches, a detailed look, and a thorough explanation of the dataset on the [Data-Card](./docs/DataCard.pdf). It is about 28k utterances & clips (couting). There are two sub-speech dataset. Griots Narrations and Street Interviews.
+
+### Griots Narrations
+These are recording of 30 griots (23 Males / 7 Females) talking about various subjects. In a controlled environment. *The subjects are culture oriented*.
+
+### Street Interviews
+Along side the griots' narrations, a smaller sample of individuals were interviewd about the importance of bambara in the technology. These interviews were conducted on the street with background noises. 
+
+**N.B**: Not all of these audios have been transcribed.
 
 ## ASR - Model
-[TODO]
+### Kaldi
+### Wav2Vec
+### Espnet
 
-## Corpus
-The Griots corpus is a speech corpus containing both audio and its accompanying transcribed text. You can find the intent, the approaches, a detailed look, and a thorough explanation of the dataset on the [Data-Card](). Refer to the following list of recordings and the general meta information about the recordings:
+<!-- ### Keras Transfomer -->
 
-### Griots Narrations
+## jelipkg toolkit (Jeli => Griot in Bambara)
+<code>jelipkg</code> is sub-package that serves as an entry point to the corpus. It is a python package that allows you to browse, and download the corpus for your own convenience, you can download the textual data either in raw text format or json format. The package can be used to download the audio in batch format or as clips (utterance) format.
 
-| Recording ID | Theme | Dialect | Utterance Count | Spkr. Gender |
-|:------------:|:-----:|:-------:|:---------------:|:------------:|
-| griots_r1 | L'histoire d'une fille | Bamako | 980 | M |
-| griots_r2 | L'histoire d'un grand marabo | Ségou | 1030 | M |
-| griots_r3 | Les forgérons | Bamako | 805 | M |
-| griots_r4 | Les Noms Authentiques | Bamako | 764 | M |
-| griots_r5 | Les Coulibaly | Bamako | 981 | M |
-| griots_r6 | Les Diarra | Ségou | 1122 | M |
-| griots_r7 | L'histoire du roi Razaly | Bamako | 1407 | M |
-| griots_r8 | L'histoire des fils d'Abraham | Bamako | 1126 | F |
-| griots_r9 | Les ''Niamala'' hommes de caste |  Bamako | 821 | M |
-| griots_r10 | L'éducaion d'hier et d'aujourd'hui | Bamako | 1078 | F |
-| griots_r11 | Garba Mama | Bamako | 970 | M |
-| griots_r12 | La Bataille de Kaana | Bamako | 997 | M |
-| griots_r13 | Diokala | Bamako | 964 | M |
-| griots_r14 | Nos ancetres | Malinké Siby | 1136 | M |
-| griots_r15 | L'histoire d'El Hadj Oumar Tall | Bamako | 844 | M |
-| griots_r16 | Les Massassi du Karta 'Bɔ' | Bamako | 941 | M |
-| griots_r17 | Histoire de Samory |  Malinké kangaba | 773 | M |
-| griots_r18 | Le griot | Malinké de kangaba | 809 | M |
-| griots_r19 | La vie d'avant en milieu Bamanan | Bamako | 611 | F |
-| griots_r20 | Les Maabo | Ségou | 1102 | M |
-| griots_r21 | L'histoire de Djonkoloni | Bamako | 859 | M |
-| griots_r22 | Various | Malinké de Siby | 926 | F |
-| griots_r23 | L'histoire de Bɔ | Ségou | 1319 | M |
-| griots_r24 | L'éducaion d'hier et d'aujourd'hui | Bamako | 942 | F |
-| griots_r25 | L'hisoire de la jeune fille Niamakolo | Bamako | 828 | F |
-| griots_r26 | Hier et aujourd'hui | Bamako | 1128 | M |
-| griots_r27 | Les Mianka | Bamako | 1166 | M |
-| griots_r28 | Le mariage d'hier et d'aujourd'hui | Bamako | 810 | F |
-| griots_r29 | L' histoire de Dabo | Bamako | 774 | M |
-| griots_r30 | Les valeurs du Mali | Bamako | 968 | M |
-|**TOTAL**||| ***28971*** ||
-||
+### Installation
+- Install a revised version of [DABA](https://github.com/maslinych/daba)
 
-### Street Interviews
-Along side the griots' narrations, a smaller sample of individuals were interviewd about the importance of bambara in the technology.
-
-| Recording ID | Utt. Count | Spkr. Gender | Status |
-|:------------:|:-------:|:------------:|:------:|
-| intrvw_r1 | 55 | F | V |
-| intrvw_r2 | X | X | X |
-| intrvw_r3 | 24 | M | V |
-| intrvw_r4 | 25 | M | V |
-| intrvw_r5 | 31 | M | V |
-| intrvw_r6 | 20 | M | V |
-| intrvw_r7 | X | X | X |
-| intrvw_r8 | X | X | X |
-| intrvw_r9 | X | X | X |
-| intrvw_r10 | X | X | X |
-| intrvw_r11 | X | X | X |
-| intrvw_r12 | X | X | X |
-| intrvw_r13 | 25 | M | V |
-| intrvw_r14 | X | X | X |
-| intrvw_r15 | X | X | X |
-| intrvw_r16 | X | X | X |
-| intrvw_r17 | X | X | X |
-| intrvw_r18 | X | X | X |
-| intrvw_r19 | X | X | X |
-| intrvw_r20 | 17 | M | V |
-| intrvw_r21 | 137 | M | V |
-| intrvw_r22 | 142 | F | V |
-| **TOTAL** | ***476*** | - | - |
-||
-
-### jelipkg toolkit
-<code>jelipkg</code> is sub-package that serves as an entry point to the corpus. It is a python package that allows you to browse, and download the corpus for your own convenience, you can download the textual data either in raw text format or json format.
-
-#### Installation
-#### Quickstart
-#### Documentation
+```bash
+$ pip install -U https://github.com/s7d11/daba/releases/download/v0.0.1-alpha/daba-0.9.2.tar.gz
+```
+
+- Install `jelipkg`
+
+```sh
+```
+
+### Quickstart
+
+- Launching the interactive shell
+
+```bash
+$ jelipkg
+```
+
+- Choose option
+
+```
+Welcome to jelipkg v0.0.1
+
+Type browse, download, help, exit
+jeli> browse
+```
+
+- Select a recording
+
+```
+jeli> Select a recording ID:
+    >   griots_r01
+        griots_r02
+        griots_r03
+        griots_r04
+        griots_r05
+        ...
+```
+
+- Choose `browsing` option
+```
+jeli> Choose browsing option:
+    >   Recording overview
+        Detailed view of recording
+```
+
+- Output
+
+```
+Recording                               griots_r1
+Theme:                     L'histoire d'une fille
+Speaker:                                        M
+Utterances:                                   982
+Duration:                                  3277.0
+Tokens:                                     12289
+Types:                                       1080
+jeli> Download griots_r1? (y/N)
+```
+
+### Documentation
+Type one of the followings to:  
+- **browse** -> Interactively browse the list of recordings  
+- **download** -> Directly download a recording from the dataset  
+- **help** -> Display the help message  
+- **exit** -> Exit the `jelipkg` console  
+
+### Bugs
+- [Bugs](https://github.com/robotsmali-ai/jeli-asr/issues)
+
+### License
+- [MIT License](./jeli/LICENSE)
+
+### Future features
+- Direct CLI (one command) capability
+- Multi-recording download
 
 **IMPORTANT**: It is recommended to download one recording/interview at a time, if you have an unreliable network due to the size of the dataset.
 
@@ -91,6 +110,18 @@ Along side the griots' narrations, a smaller sample of individuals were intervie
 **inquiries & Collaboration**: `research <at> robotsmali.org`
 
 ## Reference
+```
+@misc{griotsdataset2022,
+  author                = {Sebastien Diarra and Michael Leventhal and Mouktar Traore and Alou Dembele},
+  title                 = {RobotsMali Griots Recording},
+  howpublished          = {\url{https://github.com/robotsmali-ai/jeli-asr/}},
+  year                  = 2022
+}
+```
+
+## Known Issues
+- griots_r24_1: Bambara missing
+- **Some recording needs FRENCH adjustment**
 
 ## License
 This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
@@ -0,0 +1,105 @@
+Jeli ASR & Corpus
+
+## What is Jeli-ASR
+Jeli-ASR is a multidimentional package that was developed with the aim to empower the usage of the Bambara Language. Starting in an initiative to the develop the Bambara Language, and its cultural values. The package is consisted of an ASR model under ongoing development, and a mini corpus of griots narration in [audio](https://zenodo.org/record/6997806), its transcription in ***eaf*** which is [ELAN format](https://archive.mpi.nl/tla/elan/download), and a tool to download and exctract the dataset.
+
+## ASR - Model
+[TODO]
+
+## Corpus
+The Griots corpus is a speech corpus containing both audio and its accompanying transcribed text. You can find the intent, the approaches, a detailed look, and a thorough explanation of the dataset on the [Data-Card (coming)](). Refer to the following list of recordings and the general meta information about the recordings:
+
+### Griots Narrations
+
+| Recording ID | Theme | Dialect | Utterance Count | Spkr. Gender |
+|:------------:|:-----:|:-------:|:---------------:|:------------:|
+| griots_r1 | L'histoire d'une fille | Bamako | 980 | M |
+| griots_r2 | L'histoire d'un grand marabo | Ségou | 1030 | M |
+| griots_r3 | Les forgérons | Bamako | 805 | M |
+| griots_r4 | Les Noms Authentiques | Bamako | 764 | M |
+| griots_r5 | Les Coulibaly | Bamako | 981 | M |
+| griots_r6 | Les Diarra | Ségou | 1122 | M |
+| griots_r7 | L'histoire du roi Razaly | Bamako | 1407 | M |
+| griots_r8 | L'histoire des fils d'Abraham | Bamako | 1126 | F |
+| griots_r9 | Les ''Niamala'' hommes de caste |  Bamako | 821 | M |
+| griots_r10 | L'éducaion d'hier et d'aujourd'hui | Bamako | 1078 | F |
+| griots_r11 | Garba Mama | Bamako | 970 | M |
+| griots_r12 | La Bataille de Kaana | Bamako | 997 | M |
+| griots_r13 | Diokala | Bamako | 964 | M |
+| griots_r14 | Nos ancetres | Malinké Siby | 1136 | M |
+| griots_r15 | L'histoire d'El Hadj Oumar Tall | Bamako | 844 | M |
+| griots_r16 | Les Massassi du Karta 'Bɔ' | Bamako | 941 | M |
+| griots_r17 | Histoire de Samory |  Malinké kangaba | 773 | M |
+| griots_r18 | Le griot | Malinké de kangaba | 809 | M |
+| griots_r19 | La vie d'avant en milieu Bamanan | Bamako | 611 | F |
+| griots_r20 | Les Maabo | Ségou | 1102 | M |
+| griots_r21 | L'histoire de Djonkoloni | Bamako | 859 | M |
+| griots_r22 | Various | Malinké de Siby | 926 | F |
+| griots_r23 | L'histoire de Bɔ | Ségou | 1319 | M |
+| griots_r24 | L'éducaion d'hier et d'aujourd'hui | Bamako | 942 | F |
+| griots_r25 | L'hisoire de la jeune fille Niamakolo | Bamako | 828 | F |
+| griots_r26 | Hier et aujourd'hui | Bamako | 1128 | M |
+| griots_r27 | Les Mianka | Bamako | 1166 | M |
+| griots_r28 | Le mariage d'hier et d'aujourd'hui | Bamako | 810 | F |
+| griots_r29 | L' histoire de Dabo | Bamako | 774 | M |
+| griots_r30 | Les valeurs du Mali | Bamako | 968 | M |
+|**TOTAL**||| ***28971*** ||
+||
+
+### Street Interviews
+Along side the griots' narrations, a smaller sample of individuals were interviewd about the importance of bambara in the technology. 
+
+**N.B**: Not all of these audios have been transcribed.
+
+| Recording ID | Utt. Count | Spkr. Gender | Status |
+|:------------:|:-------:|:------------:|:------:|
+| intrvw_r1 | 55 | F | V |
+| intrvw_r2 | X | X | X |
+| intrvw_r3 | 24 | M | V |
+| intrvw_r4 | 25 | M | V |
+| intrvw_r5 | 31 | M | V |
+| intrvw_r6 | 20 | M | V |
+| intrvw_r7 | X | X | X |
+| intrvw_r8 | X | X | X |
+| intrvw_r9 | X | X | X |
+| intrvw_r10 | X | F | V |
+| intrvw_r11 | X | M | V |
+| intrvw_r12 | X | M | V |
+| intrvw_r13 | 25 | M | V |
+| intrvw_r14 | X | X | X |
+| intrvw_r15 | X | X | X |
+| intrvw_r16 | X | X | X |
+| intrvw_r17 | X | X | X |
+| intrvw_r18 | X | X | X |
+| intrvw_r19 | X | X | X |
+| intrvw_r20 | 17 | M | V |
+| intrvw_r21 | 137 | M | V |
+| intrvw_r22 | 142 | F | V |
+| **TOTAL** | ***476*** | - | - |
+||
+
+### jelipkg toolkit
+<code>jelipkg</code> is sub-package that serves as an entry point to the corpus. It is a python package that allows you to browse, and download the corpus for your own convenience, you can download the textual data either in raw text format or json format.
+
+#### Installation
+
+```bash
+$ pip install -U https://github.com/s7d11/daba/releases/download/v0.0.1-alpha/daba-0.9.2.tar.gz
+```
+
+#### Quickstart
+#### Documentation
+
+**IMPORTANT**: It is recommended to download one recording/interview at a time, if you have an unreliable network due to the size of the dataset.
+
+## Contact & People
+**Principal Investigator**: Michael Leventhal, `mleventhal <at> robotsmali.org`  
+**Manager**: Sebastien Diarra, `sdiarra <at> robotsmali.org`  
+**inquiries & Collaboration**: `research <at> robotsmali.org`
+
+## Reference
+
+## Errors & Bugs
+
+## License
+This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
@@ -0,0 +1,51 @@
+args==0.1.0
+asgiref==3.5.2
+asttokens==2.0.8
+backcall==0.2.0
+certifi==2022.6.15
+charset-normalizer==2.1.0
+cycler==0.11.0
+decorator==5.1.1
+et-xmlfile==1.1.0
+executing==0.10.0
+fonttools==4.36.0
+funcparserlib==1.0.0
+idna==3.3
+inquirerpy==0.3.4
+intervaltree==3.1.0
+ipython==8.4.0
+jedi==0.18.1
+kiwisolver==1.4.4
+matplotlib==3.5.3
+matplotlib-inline==0.1.3
+numpy==1.23.2
+openpyxl==3.0.10
+packaging==21.3
+pandas==1.4.3
+parso==0.8.3
+pexpect==4.8.0
+pfzy==0.3.4
+pickleshare==0.7.5
+Pillow==9.2.0
+prompt-toolkit==3.0.30
+ptyprocess==0.7.0
+pure-eval==0.2.2
+pydub==0.25.1
+Pygments==2.13.0
+pympi-ling==1.70.2
+pyparsing==3.0.9
+python-dateutil==2.8.2
+PyTrie==0.4.0
+pytz==2022.2.1
+regex==2022.9.13
+requests==2.28.1
+six==1.16.0
+sortedcontainers==2.4.0
+sqlparse==0.4.2
+stack-data==0.4.0
+traitlets==5.3.0
+types-requests==2.28.10
+types-urllib3==1.26.24
+urllib3==1.26.11
+wcwidth==0.2.5
+wget==3.2
@@ -0,0 +1,19 @@
+Copyright (c) 2022 RobotsMali
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,9 @@
+# TODO
+
+## CLI Integration
+Options:
+- Some statistical information on the text
+- Display Recording meta (jelipkg -i(--info) <id>)
+   - detail flag -> displays breakdown per file
+- Display recording list (jelipkg -l(--list))
+- Get Recording_ID (jelipkg get <id> -o(--output=eaf|json|txt) --audio)
-Original file line number
+Diff line change
@@ @@ -1,3 +1,4 @@ @@
 *.wav
 *.WAV
 *.py
 +cleanup.sh