Skip to content
/ VISA Public

An ambiguous subtitles dataset for visual scene-aware machine translation

License

Notifications You must be signed in to change notification settings

ku-nlp/VISA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

VISA

VISA is a dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips with the following key features:

  • The parallel sentences are subtitles from movies and TV episodes
  • The source subtitles are ambiguous, which means they have multiple possible translations with different meanings
  • We divide the dataset into Polysemy and Omission according to the cause of ambiguity

Examples:

Polysemy:

放せ! --> Let me go!

let_me_go

Omission:

銃を持ってる。 --> I have a gun.

I_carry_a_gun

Splits:

Split Train Validation test
Polysemy 18,666 1,000 1,000
Omission 17,214 1,000 1,000
Combined 35,880 2,000 2,000

Usage:

You can read json files to find the mapping from videos to parallel subtitle pairs.

Json Files Structure:

video_file_name: {  
    { "ja": Japanese_subtitle },  
    { "en": English_subtitle }  
}  

Note:

Please, note that by downloading the dataset, you agree to the following conditions:

  • Do not re-distribute the dataset without our permission.
  • The dataset can only be used for research purposes. Any other use is explicitly prohibited.

Downloadable Features:

If you are interested in the video features of VISA, you can download them from the following links:

Citation:

If you find this dataset helpful, please cite our publication "VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation":

@inproceedings{li-etal-2022-visa,
    title = "{VISA}: An Ambiguous Subtitles Dataset for Visual Scene-aware Machine Translation",
    author = "Li, Yihang  and
      Shimizu, Shuichiro  and
      Gu, Weiqi  and
      Chu, Chenhui  and
      Kurohashi, Sadao",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.725",
    pages = "6735--6743",
}

Contact:

If you have any questions about this dataset, please contact [email protected].

License:

GNU General Public License v3.0

About

An ambiguous subtitles dataset for visual scene-aware machine translation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •