Skip to content

noisychannel/ARZ_callhome_corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The CALLHOME Egyptian Arabic Speech Translation Corpus
==================================================================

This directory contains the translations generated at JHU for the following LDC datasets
1. CALLHOME Egyptian Arabic (1996) [LDC97T19] : train, dev, eval
2. NIST HUB5 evaluation dataset (1997) [LDC2002T38] : h5
3. CALLHOME Egyptian Arabic - supplements [LDC2002T39] : sup

Four translations were to be generated for each reference segment (in ECA). These are (with number of segments):
- ECA ’96 train : 20861
- ECA ’96 dev : 6415
- ECA ’96 test : 3044
- 97 eval (H5) : 2800
- ECA supplement : 2722 

The files are in the TAB separated format (TSV)
For each partition {train, dev, eval, h5, sup}, the source file is 
callhome_{train,dev,eval,h5,sup}.ar

The four reference translations are in separate files and match line-by-line the source segments
callhome_{train,dev,eval,h5,sup}.en.{0,1,2,3}

For some segments, extra (more than 4) translations were generated. These are stored in an overflow file.
callhome_{train,dev,eval,h5,sup}.en.over

------------
ASR output
------------
We supply the ASR output from two systems for all of the datasets listed above.
The ASR output is in the form of PLF (lattices), ASR 1-best output and the lattice-oracle
Note that these ASR systems were tuned on the dev set and even though we provide the
dev-ASR output, it should not be used as a test set.

WER results for the two systems (SGMM-12, DNN-9) are :

----------------------------------------------
              ASR one-best                   |
----------------------------------------------
             |  dev  | dev2  |  sup  |  h5   |
-------------|-------|-------|-------|-------|
SAT+SGMM     | 58.06 | 56.46 | 63.29 | 61.41 |
DNN-Ensemble | 50.31 | 49.54 | 58.15 | 55.76 |
----------------------------------------------

----------------------------------------------
              ASR lattice oracle             |
----------------------------------------------
             |  dev  | dev2  |  sup  |  h5   |
-------------|-------|-------|-------|-------|
SAT+SGMM     | 33.33 | 33.17 | 41.67 | 38.40 |
DNN-Ensemble | 23.56 | 23.34 | 32.22 | 28.16 |
----------------------------------------------

For a complete description of this corpus, and for citation in your own published research, please
cite the following paper. A copy can be found in the doc/ directory.

    @inproceedings{Kumar2014translations,
      Title = {Translations of the CALLHOME Egyptian Arabic Corpus for Conversational Speech Translation},
      Author = {Gaurav Kumar and Yuan Cao and Ryan Cotterell and Chris Callison-Burch and Daniel Povey and Sanjeev Khudanpur},
      year = {2014},
      Booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
      Address = {Lake Tahoe, US},
      Month = {December}
    }

Due to licensing restrictions, we cannot include the LDC Egyptian Arabic transcripts with this dataset. We
have, however, provided scripts that will construct our data splits. Modify the Makefile provided to
point to the data directories for the ECA LDC corpora. Then run:

    make

You should end up with the processed transcripts in the corpora directory;

    ls corpora/*.ar

About

The CALLHOME Egyptian Arabic Speech Translation Corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages