Releases: stanfordnlp/CoreNLP
v4.3.1
v4.3.0
v4.2.2
v4.2.1
Fix the server having some links http instead of https
#1146
Improve MWE expressions in the enhanced dependency conversion
1ef9ef9
Add the ability for the command line semgrex processor to handle multiple calls in one process
c9d50ef
Fix interaction between discarding tokens in ssplit and assigning NER tags
a803bc3
Reduce the size of the sr parser models (not a huge amount, but some)
#1142
Various QuoteAnnotator bug fixes
#1135
#1134
#1121
#1118
9f1b015
#1147
Switch to newer istack implementation
#1133
Newer protobuf
#1150
Add a conllu output format to some of the segmenter code, useful for testing with the official test scripts
c70ddec
Fix Turkish locale enums
#1126
stanfordnlp/stanza#580
Use StringBuilder instead of StringBuffer where possible
#1010
v4.2.0
Overview
This release features a collection of small bug fixes and updates. It is the first release built directly from the GitHub repo.
Enhancements
- Upgrade libraries (EJML, JUnit, JFlex)
- Add character offsets to Tregex responses from server
- Improve cleaning of treebanks for English models
- Speed up loading of Wikidict annotator
- New utility for tagging CoNLL-U files in place
- Command line tool for processing TokensRegex
Fixes
- Output single token NER entities in inline XML output format
- Add currency symbol part of speech training data
- Fix issues with tree binarizing
Stanford CoreNLP 4.0.0
Overview
The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.
Enhancements
- UD v2.0 tokenization standard for English, French, German, and Spanish. That means "new" LDC tokenization for English (splitting on most hyphens) and not escaping parentheses or turning quotes etc. into ASCII sequences by default.
- Upgrade options for normalizing special chars (quotes, parentheses, etc.) in PTBTokenizer
- Have WhitespaceTokenizer support same newline processing as PTBTokenizer
- New mwt annotator for handling multiword tokens in French, German, and Spanish.
- New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.
- Add French NER
- New Chinese segmentation based off CTB9
- Improved handling of double codepoint characters
- Easier syntax for specifying language specific pipelines and NER pipeline properties
- Improved CoNLL-U processing
- Improved speed and memory performance for CRF training
- Tregex support in CoreSentence
- Updated library dependencies
Fixes
- NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines
- NPE in EntityMentionsAnnotator during language check
- NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions
- NPE in NERCombinerAnnotator in certain configurations of models on/off
- Incorrect handling of eolonly option in ArabicSegmenterAnnotator
- Apply named entity granularity change prior to coref mention detection
- Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows
- Incorrect handling of reading in German treebank files
- SR parser crashes when given bad training input
- New PTBTokenizer known abbreviations: "Tech.", "Amb.". Fix legacy tokenizer hack special casing 'Alex.' for 'Alex. Brown'
- Fix ancient bug in printing constituency tree with multiple roots.
- Fix parser from failing on word "STOP" because it treated it as a special word