Skip to content

Latest commit

 

History

History

ellipsis

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Ellipsis and Elided Elements in Natural Language: The Hoosier Ellipsis Corpus

Created: Damir Cavar, 2023-06-07

Last change: Damir Cavar, 2024-08-17

Ellipsis and other phenomena where words in sentences and utterances are elided or omitted are extremely interesting from a theoretical linguistic and cognitive language faculty perspective. In general, we recommend looking at The Oxford Handbook of Ellipsis and the numerous research articles, books, and dissertations discussed in the different sections of the handbook. There are also highly relevant articles mentioned below in the publications and on the websites from the various ellipsis corpus projects mentioned below.

There are various reasons why we are working on ellipsis and other word-omitting phenomena. Some of those are:

We will provide research reports here soon with quantified data related to these claims. These are strong claims, but our experience has shown that the limited use of phrase structure and dependency parsers significantly relates to the failure to process Dark Matter in Language. While certainly semantic and pragmatic approaches could be tried to reconstruct omitted linguistic content, we focus on syntactic and pattern-based methods with neural and symbolic algorithms, modeling the fast and slow processing of the human language faculty when it comes to elided linguistic content.

Our goals are ambitious:

  • to develop a data set that provides enough corpus material to sufficiently document and represent the various manifestations of ellipsis, and to provide enough data for the engineering of NLP solutions
  • to engineer NLP components (parsers and language models) that can reconstruct the omitted words and that can provide theoretically adequate and computationally useful syntactic parse tree (Phrase Structure Tree and Dependency Tree) for constructions containing ellipsis
  • all this we want to cover for English, Ukrainian, Russian, Chinese, Japanese, Korean, Spanish, German...

The corpus format is documented here: The Hoosier Ellipsis Corpus - Data Format

Online Resources

We identified the following resources online:

If you have more links or if you want to share your data sets, please send us a note, dcavar at iu.edu.

The Hoosier Ellipsis Corpus (THEC)

The main corpus code and data links can be found at:

Publications

Presentations

This presentation is about ellipsis constructions in Arabic and three types of experiments using Logistic Regression, BERT-type of classifiers and guessers, and GPT-4 (ChatGPT) Large Language Models (LLM) to guess whether sentences contain ellipsis, where the ellipsis is located, and what the elided words are: