-
Notifications
You must be signed in to change notification settings - Fork 3
The CALLHOME Egyptian Arabic Speech Translation Corpus
License
noisychannel/ARZ_callhome_corpus
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The CALLHOME Egyptian Arabic Speech Translation Corpus ================================================================== This directory contains the translations generated at JHU for the following LDC datasets 1. CALLHOME Egyptian Arabic (1996) [LDC97T19] : train, dev, eval 2. NIST HUB5 evaluation dataset (1997) [LDC2002T38] : h5 3. CALLHOME Egyptian Arabic - supplements [LDC2002T39] : sup Four translations were to be generated for each reference segment (in ECA). These are (with number of segments): - ECA ’96 train : 20861 - ECA ’96 dev : 6415 - ECA ’96 test : 3044 - 97 eval (H5) : 2800 - ECA supplement : 2722 The files are in the TAB separated format (TSV) For each partition {train, dev, eval, h5, sup}, the source file is callhome_{train,dev,eval,h5,sup}.ar The four reference translations are in separate files and match line-by-line the source segments callhome_{train,dev,eval,h5,sup}.en.{0,1,2,3} For some segments, extra (more than 4) translations were generated. These are stored in an overflow file. callhome_{train,dev,eval,h5,sup}.en.over ------------ ASR output ------------ We supply the ASR output from two systems for all of the datasets listed above. The ASR output is in the form of PLF (lattices), ASR 1-best output and the lattice-oracle Note that these ASR systems were tuned on the dev set and even though we provide the dev-ASR output, it should not be used as a test set. WER results for the two systems (SGMM-12, DNN-9) are : ---------------------------------------------- ASR one-best | ---------------------------------------------- | dev | dev2 | sup | h5 | -------------|-------|-------|-------|-------| SAT+SGMM | 58.06 | 56.46 | 63.29 | 61.41 | DNN-Ensemble | 50.31 | 49.54 | 58.15 | 55.76 | ---------------------------------------------- ---------------------------------------------- ASR lattice oracle | ---------------------------------------------- | dev | dev2 | sup | h5 | -------------|-------|-------|-------|-------| SAT+SGMM | 33.33 | 33.17 | 41.67 | 38.40 | DNN-Ensemble | 23.56 | 23.34 | 32.22 | 28.16 | ---------------------------------------------- For a complete description of this corpus, and for citation in your own published research, please cite the following paper. A copy can be found in the doc/ directory. @inproceedings{Kumar2014translations, Title = {Translations of the CALLHOME Egyptian Arabic Corpus for Conversational Speech Translation}, Author = {Gaurav Kumar and Yuan Cao and Ryan Cotterell and Chris Callison-Burch and Daniel Povey and Sanjeev Khudanpur}, year = {2014}, Booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)}, Address = {Lake Tahoe, US}, Month = {December} } Due to licensing restrictions, we cannot include the LDC Egyptian Arabic transcripts with this dataset. We have, however, provided scripts that will construct our data splits. Modify the Makefile provided to point to the data directories for the ECA LDC corpora. Then run: make You should end up with the processed transcripts in the corpora directory; ls corpora/*.ar
About
The CALLHOME Egyptian Arabic Speech Translation Corpus
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published