Skip to content

Releases: stanfordnlp/CoreNLP

v4.3.1

22 Oct 11:28
@J38 J38
Compare
Choose a tag to compare

Fixes

  • character offset issue with StatTok
  • fixes path issue with default Hungarian properties
  • adds Hungarian and Italian to demo
  • fixes umlaut issue

v4.3.0

06 Oct 10:54
@J38 J38
Compare
Choose a tag to compare

Overview

This release adds new European languages, improvements to the parsers and tokenizers, and other misc. fixes.

Enhancements

  • Hungarian pipeline
  • Italian pipeline
  • Improvements to English tokenizer
  • Better memory usage by dependency parser

Fixes

  • issue with umlaut handling in German #1184

v4.2.2

14 May 21:36
@J38 J38
Compare
Choose a tag to compare

This release includes some small fixes to version 4.2.1.

It includes:

  • demo fixes for 4.2.2, resolving cache issues with demo resources
  • small fix to RegexNERSequenceClassifier issue allowing AnswerAnnotation to be overwritten

v4.2.1

05 May 20:58
Compare
Choose a tag to compare

Fix the server having some links http instead of https
#1146

Improve MWE expressions in the enhanced dependency conversion
1ef9ef9

Add the ability for the command line semgrex processor to handle multiple calls in one process
c9d50ef

Fix interaction between discarding tokens in ssplit and assigning NER tags
a803bc3

Reduce the size of the sr parser models (not a huge amount, but some)
#1142

Various QuoteAnnotator bug fixes
#1135
#1134
#1121
#1118
9f1b015
#1147

Switch to newer istack implementation
#1133
Newer protobuf
#1150

Add a conllu output format to some of the segmenter code, useful for testing with the official test scripts
c70ddec

Fix Turkish locale enums
#1126
stanfordnlp/stanza#580

Use StringBuilder instead of StringBuffer where possible
#1010

v4.2.0

17 Nov 10:17
@J38 J38
Compare
Choose a tag to compare

Overview

This release features a collection of small bug fixes and updates. It is the first release built directly from the GitHub repo.

Enhancements

  • Upgrade libraries (EJML, JUnit, JFlex)
  • Add character offsets to Tregex responses from server
  • Improve cleaning of treebanks for English models
  • Speed up loading of Wikidict annotator
  • New utility for tagging CoNLL-U files in place
  • Command line tool for processing TokensRegex

Fixes

  • Output single token NER entities in inline XML output format
  • Add currency symbol part of speech training data
  • Fix issues with tree binarizing

Stanford CoreNLP 4.0.0

04 May 02:47
@J38 J38
Compare
Choose a tag to compare

Overview

The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.

Enhancements

  • UD v2.0 tokenization standard for English, French, German, and Spanish. That means "new" LDC tokenization for English (splitting on most hyphens) and not escaping parentheses or turning quotes etc. into ASCII sequences by default.
  • Upgrade options for normalizing special chars (quotes, parentheses, etc.) in PTBTokenizer
  • Have WhitespaceTokenizer support same newline processing as PTBTokenizer
  • New mwt annotator for handling multiword tokens in French, German, and Spanish.
  • New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.
  • Add French NER
  • New Chinese segmentation based off CTB9
  • Improved handling of double codepoint characters
  • Easier syntax for specifying language specific pipelines and NER pipeline properties
  • Improved CoNLL-U processing
  • Improved speed and memory performance for CRF training
  • Tregex support in CoreSentence
  • Updated library dependencies

Fixes

  • NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines
  • NPE in EntityMentionsAnnotator during language check
  • NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions
  • NPE in NERCombinerAnnotator in certain configurations of models on/off
  • Incorrect handling of eolonly option in ArabicSegmenterAnnotator
  • Apply named entity granularity change prior to coref mention detection
  • Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows
  • Incorrect handling of reading in German treebank files
  • SR parser crashes when given bad training input
  • New PTBTokenizer known abbreviations: "Tech.", "Amb.". Fix legacy tokenizer hack special casing 'Alex.' for 'Alex. Brown'
  • Fix ancient bug in printing constituency tree with multiple roots.
  • Fix parser from failing on word "STOP" because it treated it as a special word