Skip to content

Files

Latest commit

 

History

History

dataset

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jul 26, 2020
Jul 26, 2020
Jul 26, 2020
Jul 26, 2020
Jul 26, 2020

Dataset

This is a dummy directory to preview the directory structure of the dataset. A sample processed json object is available here.

Download the dataset here.

Data

The current release of the dataset has documents processed from the following conference proceedings.

NeurIPS EMNLP ACL InterSpeech
- - - -
2019 2019 2019 2019
2018 - 2018 2018
2017 - 2017 2017
2016 - 2016 -
2015 - 2015 -

Note: Few files from certain proceedings are dropped from the dataset due to parsing errors.

Features

The dataset contains the following fields extracted from each document.

  • Semantically extracted fields using a nltk and spacy pipeline.
    • entities: Named Entity Recognition is performed on the document text to extract text span as entites and tag them with entity_type.
    • tags: Part of Speech tagged tokens extracted from document text.
    • parser: Dependency Parsing between text spans of document.
    • noun_chunks: Base noun phrases that have a noun as their head.
  • Metadata fields extracted using PyPDF2
    filename metadata numPages
    title author subject
    creator producer keywords
    creationdate moddate trapped
    ptexfullbanner raw_text -