Skip to content

Commit 81e5bc3

Browse files
committed
Development branch with repo update
1 parent 22880fd commit 81e5bc3

File tree

220 files changed

+990168
-77
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

220 files changed

+990168
-77
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
*.wav
22
*.WAV
33
*.py
4+
cleanup.sh

README.md

Lines changed: 107 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -1,87 +1,106 @@
1-
# Jeli ASR & Corpus
1+
# Jeli ASR & Dataset
22

33
## What is Jeli-ASR
4-
Jeli-ASR is a multidimentional package that was developed with the aim to empower the usage of the Bambara Language. Starting in an initiative to the develop the Bambara Language, and its cultural values. The package is consisted of an ASR model under ongoing development, and a mini corpus of griots narration in [audio](https://zenodo.org/record/6997806), its transcription in eaf which is [ELAN format](https://archive.mpi.nl/tla/elan/download), and a package tool that can yield the transcription in raw text format or json.
4+
This is a multidimentional open-source package consisting of a dataset & an ASR model. The dataset consists of the transcriptions of 30 hours of griots stories and narrations, and their translations. The corresponding [audio](https://zenodo.org/record/7094702) is hosted on zenodo. The ASR model is an ongoing attempt at an automatic speech recognition model for bambara.
5+
6+
## Dataset
7+
The Griots corpus is a speech corpus containing both audio and its accompanying transcribed text. You can find the intent, the approaches, a detailed look, and a thorough explanation of the dataset on the [Data-Card](./docs/DataCard.pdf). It is about 28k utterances & clips (couting). There are two sub-speech dataset. Griots Narrations and Street Interviews.
8+
9+
### Griots Narrations
10+
These are recording of 30 griots (23 Males / 7 Females) talking about various subjects. In a controlled environment. *The subjects are culture oriented*.
11+
12+
### Street Interviews
13+
Along side the griots' narrations, a smaller sample of individuals were interviewd about the importance of bambara in the technology. These interviews were conducted on the street with background noises.
14+
15+
**N.B**: Not all of these audios have been transcribed.
516

617
## ASR - Model
7-
[TODO]
18+
### Kaldi
19+
### Wav2Vec
20+
### Espnet
821

9-
## Corpus
10-
The Griots corpus is a speech corpus containing both audio and its accompanying transcribed text. You can find the intent, the approaches, a detailed look, and a thorough explanation of the dataset on the [Data-Card](). Refer to the following list of recordings and the general meta information about the recordings:
22+
<!-- ### Keras Transfomer -->
1123

12-
### Griots Narrations
24+
## jelipkg toolkit (Jeli => Griot in Bambara)
25+
<code>jelipkg</code> is sub-package that serves as an entry point to the corpus. It is a python package that allows you to browse, and download the corpus for your own convenience, you can download the textual data either in raw text format or json format. The package can be used to download the audio in batch format or as clips (utterance) format.
1326

14-
| Recording ID | Theme | Dialect | Utterance Count | Spkr. Gender |
15-
|:------------:|:-----:|:-------:|:---------------:|:------------:|
16-
| griots_r1 | L'histoire d'une fille | Bamako | 980 | M |
17-
| griots_r2 | L'histoire d'un grand marabo | Ségou | 1030 | M |
18-
| griots_r3 | Les forgérons | Bamako | 805 | M |
19-
| griots_r4 | Les Noms Authentiques | Bamako | 764 | M |
20-
| griots_r5 | Les Coulibaly | Bamako | 981 | M |
21-
| griots_r6 | Les Diarra | Ségou | 1122 | M |
22-
| griots_r7 | L'histoire du roi Razaly | Bamako | 1407 | M |
23-
| griots_r8 | L'histoire des fils d'Abraham | Bamako | 1126 | F |
24-
| griots_r9 | Les ''Niamala'' hommes de caste | Bamako | 821 | M |
25-
| griots_r10 | L'éducaion d'hier et d'aujourd'hui | Bamako | 1078 | F |
26-
| griots_r11 | Garba Mama | Bamako | 970 | M |
27-
| griots_r12 | La Bataille de Kaana | Bamako | 997 | M |
28-
| griots_r13 | Diokala | Bamako | 964 | M |
29-
| griots_r14 | Nos ancetres | Malinké Siby | 1136 | M |
30-
| griots_r15 | L'histoire d'El Hadj Oumar Tall | Bamako | 844 | M |
31-
| griots_r16 | Les Massassi du Karta 'Bɔ' | Bamako | 941 | M |
32-
| griots_r17 | Histoire de Samory | Malinké kangaba | 773 | M |
33-
| griots_r18 | Le griot | Malinké de kangaba | 809 | M |
34-
| griots_r19 | La vie d'avant en milieu Bamanan | Bamako | 611 | F |
35-
| griots_r20 | Les Maabo | Ségou | 1102 | M |
36-
| griots_r21 | L'histoire de Djonkoloni | Bamako | 859 | M |
37-
| griots_r22 | Various | Malinké de Siby | 926 | F |
38-
| griots_r23 | L'histoire de Bɔ | Ségou | 1319 | M |
39-
| griots_r24 | L'éducaion d'hier et d'aujourd'hui | Bamako | 942 | F |
40-
| griots_r25 | L'hisoire de la jeune fille Niamakolo | Bamako | 828 | F |
41-
| griots_r26 | Hier et aujourd'hui | Bamako | 1128 | M |
42-
| griots_r27 | Les Mianka | Bamako | 1166 | M |
43-
| griots_r28 | Le mariage d'hier et d'aujourd'hui | Bamako | 810 | F |
44-
| griots_r29 | L' histoire de Dabo | Bamako | 774 | M |
45-
| griots_r30 | Les valeurs du Mali | Bamako | 968 | M |
46-
|**TOTAL**||| ***28971*** ||
47-
||
27+
### Installation
28+
- Install a revised version of [DABA](https://github.com/maslinych/daba)
4829

49-
### Street Interviews
50-
Along side the griots' narrations, a smaller sample of individuals were interviewd about the importance of bambara in the technology.
51-
52-
| Recording ID | Utt. Count | Spkr. Gender | Status |
53-
|:------------:|:-------:|:------------:|:------:|
54-
| intrvw_r1 | 55 | F | V |
55-
| intrvw_r2 | X | X | X |
56-
| intrvw_r3 | 24 | M | V |
57-
| intrvw_r4 | 25 | M | V |
58-
| intrvw_r5 | 31 | M | V |
59-
| intrvw_r6 | 20 | M | V |
60-
| intrvw_r7 | X | X | X |
61-
| intrvw_r8 | X | X | X |
62-
| intrvw_r9 | X | X | X |
63-
| intrvw_r10 | X | X | X |
64-
| intrvw_r11 | X | X | X |
65-
| intrvw_r12 | X | X | X |
66-
| intrvw_r13 | 25 | M | V |
67-
| intrvw_r14 | X | X | X |
68-
| intrvw_r15 | X | X | X |
69-
| intrvw_r16 | X | X | X |
70-
| intrvw_r17 | X | X | X |
71-
| intrvw_r18 | X | X | X |
72-
| intrvw_r19 | X | X | X |
73-
| intrvw_r20 | 17 | M | V |
74-
| intrvw_r21 | 137 | M | V |
75-
| intrvw_r22 | 142 | F | V |
76-
| **TOTAL** | ***476*** | - | - |
77-
||
78-
79-
### jelipkg toolkit
80-
<code>jelipkg</code> is sub-package that serves as an entry point to the corpus. It is a python package that allows you to browse, and download the corpus for your own convenience, you can download the textual data either in raw text format or json format.
81-
82-
#### Installation
83-
#### Quickstart
84-
#### Documentation
30+
```bash
31+
$ pip install -U https://github.com/s7d11/daba/releases/download/v0.0.1-alpha/daba-0.9.2.tar.gz
32+
```
33+
34+
- Install `jelipkg`
35+
36+
```sh
37+
```
38+
39+
### Quickstart
40+
41+
- Launching the interactive shell
42+
43+
```bash
44+
$ jelipkg
45+
```
46+
47+
- Choose option
48+
49+
```
50+
Welcome to jelipkg v0.0.1
51+
52+
Type browse, download, help, exit
53+
jeli> browse
54+
```
55+
56+
- Select a recording
57+
58+
```
59+
jeli> Select a recording ID:
60+
> griots_r01
61+
griots_r02
62+
griots_r03
63+
griots_r04
64+
griots_r05
65+
...
66+
```
67+
68+
- Choose `browsing` option
69+
```
70+
jeli> Choose browsing option:
71+
> Recording overview
72+
Detailed view of recording
73+
```
74+
75+
- Output
76+
77+
```
78+
Recording griots_r1
79+
Theme: L'histoire d'une fille
80+
Speaker: M
81+
Utterances: 982
82+
Duration: 3277.0
83+
Tokens: 12289
84+
Types: 1080
85+
jeli> Download griots_r1? (y/N)
86+
```
87+
88+
### Documentation
89+
Type one of the followings to:
90+
- **browse** -> Interactively browse the list of recordings
91+
- **download** -> Directly download a recording from the dataset
92+
- **help** -> Display the help message
93+
- **exit** -> Exit the `jelipkg` console
94+
95+
### Bugs
96+
- [Bugs](https://github.com/robotsmali-ai/jeli-asr/issues)
97+
98+
### License
99+
- [MIT License](./jeli/LICENSE)
100+
101+
### Future features
102+
- Direct CLI (one command) capability
103+
- Multi-recording download
85104

86105
**IMPORTANT**: It is recommended to download one recording/interview at a time, if you have an unreliable network due to the size of the dataset.
87106

@@ -91,6 +110,18 @@ Along side the griots' narrations, a smaller sample of individuals were intervie
91110
**inquiries & Collaboration**: `research <at> robotsmali.org`
92111

93112
## Reference
113+
```
114+
@misc{griotsdataset2022,
115+
author = {Sebastien Diarra and Michael Leventhal and Mouktar Traore and Alou Dembele},
116+
title = {RobotsMali Griots Recording},
117+
howpublished = {\url{https://github.com/robotsmali-ai/jeli-asr/}},
118+
year = 2022
119+
}
120+
```
121+
122+
## Known Issues
123+
- griots_r24_1: Bambara missing
124+
- **Some recording needs FRENCH adjustment**
94125

95126
## License
96127
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

docs/README.old.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
Jeli ASR & Corpus
2+
3+
## What is Jeli-ASR
4+
Jeli-ASR is a multidimentional package that was developed with the aim to empower the usage of the Bambara Language. Starting in an initiative to the develop the Bambara Language, and its cultural values. The package is consisted of an ASR model under ongoing development, and a mini corpus of griots narration in [audio](https://zenodo.org/record/6997806), its transcription in ***eaf*** which is [ELAN format](https://archive.mpi.nl/tla/elan/download), and a tool to download and exctract the dataset.
5+
6+
## ASR - Model
7+
[TODO]
8+
9+
## Corpus
10+
The Griots corpus is a speech corpus containing both audio and its accompanying transcribed text. You can find the intent, the approaches, a detailed look, and a thorough explanation of the dataset on the [Data-Card (coming)](). Refer to the following list of recordings and the general meta information about the recordings:
11+
12+
### Griots Narrations
13+
14+
| Recording ID | Theme | Dialect | Utterance Count | Spkr. Gender |
15+
|:------------:|:-----:|:-------:|:---------------:|:------------:|
16+
| griots_r1 | L'histoire d'une fille | Bamako | 980 | M |
17+
| griots_r2 | L'histoire d'un grand marabo | Ségou | 1030 | M |
18+
| griots_r3 | Les forgérons | Bamako | 805 | M |
19+
| griots_r4 | Les Noms Authentiques | Bamako | 764 | M |
20+
| griots_r5 | Les Coulibaly | Bamako | 981 | M |
21+
| griots_r6 | Les Diarra | Ségou | 1122 | M |
22+
| griots_r7 | L'histoire du roi Razaly | Bamako | 1407 | M |
23+
| griots_r8 | L'histoire des fils d'Abraham | Bamako | 1126 | F |
24+
| griots_r9 | Les ''Niamala'' hommes de caste | Bamako | 821 | M |
25+
| griots_r10 | L'éducaion d'hier et d'aujourd'hui | Bamako | 1078 | F |
26+
| griots_r11 | Garba Mama | Bamako | 970 | M |
27+
| griots_r12 | La Bataille de Kaana | Bamako | 997 | M |
28+
| griots_r13 | Diokala | Bamako | 964 | M |
29+
| griots_r14 | Nos ancetres | Malinké Siby | 1136 | M |
30+
| griots_r15 | L'histoire d'El Hadj Oumar Tall | Bamako | 844 | M |
31+
| griots_r16 | Les Massassi du Karta 'Bɔ' | Bamako | 941 | M |
32+
| griots_r17 | Histoire de Samory | Malinké kangaba | 773 | M |
33+
| griots_r18 | Le griot | Malinké de kangaba | 809 | M |
34+
| griots_r19 | La vie d'avant en milieu Bamanan | Bamako | 611 | F |
35+
| griots_r20 | Les Maabo | Ségou | 1102 | M |
36+
| griots_r21 | L'histoire de Djonkoloni | Bamako | 859 | M |
37+
| griots_r22 | Various | Malinké de Siby | 926 | F |
38+
| griots_r23 | L'histoire de Bɔ | Ségou | 1319 | M |
39+
| griots_r24 | L'éducaion d'hier et d'aujourd'hui | Bamako | 942 | F |
40+
| griots_r25 | L'hisoire de la jeune fille Niamakolo | Bamako | 828 | F |
41+
| griots_r26 | Hier et aujourd'hui | Bamako | 1128 | M |
42+
| griots_r27 | Les Mianka | Bamako | 1166 | M |
43+
| griots_r28 | Le mariage d'hier et d'aujourd'hui | Bamako | 810 | F |
44+
| griots_r29 | L' histoire de Dabo | Bamako | 774 | M |
45+
| griots_r30 | Les valeurs du Mali | Bamako | 968 | M |
46+
|**TOTAL**||| ***28971*** ||
47+
||
48+
49+
### Street Interviews
50+
Along side the griots' narrations, a smaller sample of individuals were interviewd about the importance of bambara in the technology.
51+
52+
**N.B**: Not all of these audios have been transcribed.
53+
54+
| Recording ID | Utt. Count | Spkr. Gender | Status |
55+
|:------------:|:-------:|:------------:|:------:|
56+
| intrvw_r1 | 55 | F | V |
57+
| intrvw_r2 | X | X | X |
58+
| intrvw_r3 | 24 | M | V |
59+
| intrvw_r4 | 25 | M | V |
60+
| intrvw_r5 | 31 | M | V |
61+
| intrvw_r6 | 20 | M | V |
62+
| intrvw_r7 | X | X | X |
63+
| intrvw_r8 | X | X | X |
64+
| intrvw_r9 | X | X | X |
65+
| intrvw_r10 | X | F | V |
66+
| intrvw_r11 | X | M | V |
67+
| intrvw_r12 | X | M | V |
68+
| intrvw_r13 | 25 | M | V |
69+
| intrvw_r14 | X | X | X |
70+
| intrvw_r15 | X | X | X |
71+
| intrvw_r16 | X | X | X |
72+
| intrvw_r17 | X | X | X |
73+
| intrvw_r18 | X | X | X |
74+
| intrvw_r19 | X | X | X |
75+
| intrvw_r20 | 17 | M | V |
76+
| intrvw_r21 | 137 | M | V |
77+
| intrvw_r22 | 142 | F | V |
78+
| **TOTAL** | ***476*** | - | - |
79+
||
80+
81+
### jelipkg toolkit
82+
<code>jelipkg</code> is sub-package that serves as an entry point to the corpus. It is a python package that allows you to browse, and download the corpus for your own convenience, you can download the textual data either in raw text format or json format.
83+
84+
#### Installation
85+
86+
```bash
87+
$ pip install -U https://github.com/s7d11/daba/releases/download/v0.0.1-alpha/daba-0.9.2.tar.gz
88+
```
89+
90+
#### Quickstart
91+
#### Documentation
92+
93+
**IMPORTANT**: It is recommended to download one recording/interview at a time, if you have an unreliable network due to the size of the dataset.
94+
95+
## Contact & People
96+
**Principal Investigator**: Michael Leventhal, `mleventhal <at> robotsmali.org`
97+
**Manager**: Sebastien Diarra, `sdiarra <at> robotsmali.org`
98+
**inquiries & Collaboration**: `research <at> robotsmali.org`
99+
100+
## Reference
101+
102+
## Errors & Bugs
103+
104+
## License
105+
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

docs/requirements.txt

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
args==0.1.0
2+
asgiref==3.5.2
3+
asttokens==2.0.8
4+
backcall==0.2.0
5+
certifi==2022.6.15
6+
charset-normalizer==2.1.0
7+
cycler==0.11.0
8+
decorator==5.1.1
9+
et-xmlfile==1.1.0
10+
executing==0.10.0
11+
fonttools==4.36.0
12+
funcparserlib==1.0.0
13+
idna==3.3
14+
inquirerpy==0.3.4
15+
intervaltree==3.1.0
16+
ipython==8.4.0
17+
jedi==0.18.1
18+
kiwisolver==1.4.4
19+
matplotlib==3.5.3
20+
matplotlib-inline==0.1.3
21+
numpy==1.23.2
22+
openpyxl==3.0.10
23+
packaging==21.3
24+
pandas==1.4.3
25+
parso==0.8.3
26+
pexpect==4.8.0
27+
pfzy==0.3.4
28+
pickleshare==0.7.5
29+
Pillow==9.2.0
30+
prompt-toolkit==3.0.30
31+
ptyprocess==0.7.0
32+
pure-eval==0.2.2
33+
pydub==0.25.1
34+
Pygments==2.13.0
35+
pympi-ling==1.70.2
36+
pyparsing==3.0.9
37+
python-dateutil==2.8.2
38+
PyTrie==0.4.0
39+
pytz==2022.2.1
40+
regex==2022.9.13
41+
requests==2.28.1
42+
six==1.16.0
43+
sortedcontainers==2.4.0
44+
sqlparse==0.4.2
45+
stack-data==0.4.0
46+
traitlets==5.3.0
47+
types-requests==2.28.10
48+
types-urllib3==1.26.24
49+
urllib3==1.26.11
50+
wcwidth==0.2.5
51+
wget==3.2

jeli/LICENSE

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
Copyright (c) 2022 RobotsMali
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy
4+
of this software and associated documentation files (the "Software"), to deal
5+
in the Software without restriction, including without limitation the rights
6+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7+
copies of the Software, and to permit persons to whom the Software is
8+
furnished to do so, subject to the following conditions:
9+
10+
The above copyright notice and this permission notice shall be included in all
11+
copies or substantial portions of the Software.
12+
13+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19+
SOFTWARE.

jeli/TODO.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# TODO
2+
3+
## CLI Integration
4+
Options:
5+
- Some statistical information on the text
6+
- Display Recording meta (jelipkg -i(--info) <id>)
7+
- detail flag -> displays breakdown per file
8+
- Display recording list (jelipkg -l(--list))
9+
- Get Recording_ID (jelipkg get <id> -o(--output=eaf|json|txt) --audio)

0 commit comments

Comments
 (0)