dataset

Name	Name	Last commit message	Last commit date
parent directory ..
acl	acl	Update readme and scripts	Jul 26, 2020
emnlp/emnlp_2019	emnlp/emnlp_2019	Update readme and scripts	Jul 26, 2020
interspeech	interspeech	Update readme and scripts	Jul 26, 2020
neurips	neurips	Update readme and scripts	Jul 26, 2020
README.md	README.md	Update readme and scripts	Jul 26, 2020

README.md

This is a dummy directory to preview the directory structure of the dataset. A sample processed json object is available here.

Download the dataset here.

The current release of the dataset has documents processed from the following conference proceedings.

NeurIPS	EMNLP	ACL	InterSpeech
-	-	-	-
2019	2019	2019	2019
2018	-	2018	2018
2017	-	2017	2017
2016	-	2016	-
2015	-	2015	-

Note: Few files from certain proceedings are dropped from the dataset due to parsing errors.

The dataset contains the following fields extracted from each document.

Semantically extracted fields using a nltk and spacy pipeline.
- entities: Named Entity Recognition is performed on the document text to extract text span as entites and tag them with entity_type.
- tags: Part of Speech tagged tokens extracted from document text.
- parser: Dependency Parsing between text spans of document.
- noun_chunks: Base noun phrases that have a noun as their head.
Metadata fields extracted using PyPDF2

filename metadata numPages

title author subject

creator producer keywords

creationdate moddate trapped

ptexfullbanner raw_text -